In the first small company I worked for, the husband and wife business owners took tape backups of our software home with them from the office every Friday night ... so they'd be able to resurrect their business in the event of an office fire or something. I doubt they ever did any serious disaster recovery testing though.
The only reason you do backups .. is so that you can do a restore.
And if you haven't tested your backup/restore procedure, then it's a little like Schroedinger's Cat. You don't know if you have a backup .. until you actually successfully do a restore.
Alex / talexb / Toronto
Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.
| [reply] |
| [reply] [d/l] |
if you haven't tested your backup/restore procedure, then it's a little like Schroedinger's Cat. You don't know if you have a backup .. until you actually successfully do a restore.
Hey, it's war story time again! ;-)
On the last few days of my final year at university, a student-managed little server in my favorite lab had lost a lot of data. I don't remember the exact details, I think it lost an entire harddisk. The server was an old tower PC, build around something like a Pentium-II, with no redundancy at all, all consumer parts, no server parts, filled with old harddisks, and a big fan tied to the front of the case with old wires. I guess all of its parts were picked out of the dumpster. It ran Linux, probably an early version of Debian, and it had a SCSI tape streamer. Actually, two streamers, one online, one "offline" in the spare parts bin.
Someone has set up a cron job to use tar to write a backup to tape. Great idea, that's what tar was designed for. One of the students must have swapped the tapes each morning. Larger disks were added, and some day, the tape was full. Backup failed. Some "clever" guy must have found tar's -z option to compress data using gzip, and added that option to the cron job. Backup worked again, tapes had some room again. Nobody verified or tested the backup.
Then, data was lost. Restoring the backups failed. The tapes were worn out and had several read errors, streamers were dirty as hell. tar can handle tapes with errors. It uses fixed-size blocks, and if a block is not readable, it can at least find the next file on tape and continue from there. That way, you won't get all of your data back, but probably a lot of it. Remember the -z option? The cron job wrote a gzip compressed byte stream to the tapes. No more fixed blocks, and gzip absolutely does not like I/O errors while decompressing a compressed data stream. All tape-handling advantages of tar were lost.
In the end, I had a lot of free time that day, and so I could help recovering data from the tape. We found another large, empty harddisk, and used something like dd if=/dev/tape conv=noerror of=/mnt/tmpdisk/backup.tar.gz to get a damaged, but readable compressed tape archive. It could be decompressed, at least partially, and tar was then able to extract a lot of files. Swapping the streamers allowed to read some more data from the current tape. The other tape could also be read partially, and a few more, but older files were recovered. I left sorting out old and new, damaged and sane files and copying them back to the replacement disk to the admin, and told him to fix some things:
- get rid of the -z flag to tar in the cron job, NOW
- get new tapes, preferably longer tapes
- discard the old, worn-out tapes
- get a cleaning tape
- clean up both streamers
- verify the archive on tape after backup
- preferably, get another junk PC, connect the second streamer to that PC, and use that PC to actually test data recovery
In the end, a lot of data was recovered, some from the tapes, some from student PCs in the lab, some from some old disks in the junk bin. But a lot was lost.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] [select] |
| [reply] |
> serious disaster recovery testing
Careful! "Serious" testing sometimes leads to serious problems.
I don't remember the details anymore but I heard about a test involving "quick" reboots of some key infrastructure of a data center.
Nobody expected that rebooting too many servers at the same time overwhelmed the peak electric supply, which in turn led to shutdowns.
Bottom line: Better test the testing!
| [reply] |