SitNews - Stories in the News - Ketchikan, Alaska

Viewpoints

An accurate, technical explanation of the PFD data loss
By Norm Snyder

 

March 22, 2007
Thursday AM


You may have read a story about a Permanent Fund Dividend (PFD) data loss, following is an accurate, technical explanation of the occurrence.

The PFD Division uses a pair of clustered Dell servers running MS SQL 2005 connected to Dell EMC storage arrays with approximately 3 Terabytes (Tb) of usable space. The PFD database consists of approximately 1.5 Tb of data and images with an additional 300-500Mb of data and images added each year. The PFD Division uses a pair of clustered Dell servers running MS SQL 2005 connected to Dell EMC storage arrays with approximately 3 Terabytes (Tb) of usable space. The PFD database consists of approximately 1.5 Tb of data and images with an additional 300-500Mb of data and images added each year.

The majority of the PFD database consists of paper documents that have been scanned and added to the database as digital images. Since the space requirement for images is quite large, the image portion of the database consists of a partitioned table, which is divided into file groups, each year being a separate partition in it's own file group. The active year is a read/write file group and once closed out, it is marked as read only. Then the entire database is backed up using filegroup (partial) backups, followed by a transaction log backup. These database backup files are then backed up to tape and safely stored at several locations. The partition table is a new feature of MS SQL 2005, making it much easier to backup large databases. Since the data in read-only file groups does not change, it does not have to be backed up as frequently as the information kept in the active read-write file groups. This flexibility also adds another layer of complexity to file management. As new active read/write file groups are added to a database, it is imperative that they are also backed up at the same time as the other active file groups, along with transactions logs, otherwise you will be unable to completely restore and bring online the database.

In June of 2006 hard errors were reported by one of the disk drive storage array processors. Our Network Specialist established a service call with the hardware vendor, and was advised to run a Background Verify as the last option to repair inconsistencies in the Raid stripe set. This failed to correct the problem. The remaining option left was to unbind then bind the two LUN's that were corrupted. First the Specialist moved the database filegroups stored on the corrupted LUN's to another LUN in preparation for the unbind and bind (unbinding and rebinding destroys all data on the LUN). Unfortunately, it is difficult to correlate which LUN number corresponds to which drive letter designation on the server, and some of the files were mistakenly moved to the LUN's which were to be rebound. The Spec was working on a remote desktop session with a Dell storage specialist at the time of the unbind and rebind, and the Dell specialist also missed this error. After the rebind it was discovered that one of the LUN's rebound had contained data files and also the SQL backup files.

The Network Specialist then attempted to restore the database backups from tape. Unfortunately one critical file, the primary filegroup (MDF) for the PFD Database had inadvertently not been selected to be written to tape during the normal backup process. Without a current backup of the primary .MDF file, the database could not be restored and brought online.

Our IT Staff worked over the weekend restoring all of the Read Only historical file groups for the years 2000 thru 2005, using an older backup of the Primary (MDF) file, but was unable to successfully restore the 2006 filegroup. Even though there was a current backup for 2006 and transaction log backups, a matching backup of the Primary file was also needed.

While little actual data was lost, what was lost was 800,000 paper documents that had been received during 2006 and scanned into the system. It must be made perfectly clear that the loss of these images was not the fault of the backup software, the tape library, storage array, or any other components. It was strictly human error and the consequences of not placing a check box next to the .MDF file, instructing the tape backup software to place the file onto tape.

The PFD division retains all paper documents received for at least one full year. To recover, the database was placed online with an active (2006) file group containing no data. Four seasonal employees returned to work in the summer of 2006 and for a period of 2 1/2 months, the paper documents were rescanned and the active file group was repopulated with the newly scanned images. The images that could be OCR'd were automatically linked with their corresponding data elements. Those that could not be OCR'd were manually linked. Once again, no data or images were permanently lost in the process.

Our IT Staff and PFD application programmers are now much more familiar with what is required to successfully restore this database from filegroup backups, some of which is new in MS SQL 2005.

We have since added additional disk storage to the disk drive array and now have a total of 7 Tb usable space making it much easier to restore databases and perform regular tests. We now have a formal written backup plan where active file groups and logs are backed up daily and full backups which include all file groups are performed quarterly. A team of IT and PFD Programming staff now review and certify backup logs daily and review all database properties, backup procedures, and scripts monthly. Backup cycle rotations are reviewed and each scheduled backup is now certified by three people as being complete. IT and PFD Programming staff also perform end to end backup and restore drills quarterly, verifying that the entire database can be restored from tapes used in the Quarterly full backups. Future plans include the development of an offsite file replication system to be located in our Anchorage facility.

Norm Snyder
IT Manager
Alaska Department of Revenue


Note: Comments published on Viewpoints are the opinions of the writer
and do not necessarily reflect the opinions of Sitnews.



Publish A Letter on SitNews
        Read Letters/Opinions

Contact the Editor

SitNews ©2007
Stories In The News
Ketchikan, Alaska