in reply to Have you ever lost your work?

harangzsolt33, congratulations on a thoughtful and meticulously planned meditation!

Further to the excellent replies you've already received, I thought I'd add a couple of anecdotes.

In the first small company I worked for, the husband and wife business owners took tape backups of our software home with them from the office every Friday night ... so they'd be able to resurrect their business in the event of an office fire or something. I doubt they ever did any serious disaster recovery testing though.

In larger companies I've worked for, auditors enforced regular off-site backups (stored in a bank vault IIRC), along with a formal Disaster recovery plan. Not sure if these plans were designed for the business to survive a nuclear strike on the city that wiped out both the office and the bank vaults.

The Arctic World Archive (AWA) is a facility for data preservation, located in the Svalbard archipelago on the island of Spitsbergen, Norway, not far from the Svalbard Global Seed Vault. It contains data of historical and cultural interest from several countries, as well as all of American multinational company GitHub's open source code, in a deeply buried steel vault, with the data storage medium expected to last for 500 to 1,000 years.

-- from Arctic World Archive

If I understand this correctly, code on github would even survive a global nuclear war and the destruction of civilization?

Disaster Recovery References

👁️🍾👍🦟
  • Comment on Re: Have you ever lost your work? (disaster recovery)

Replies are listed 'Best First'.
Re^2: Have you ever lost your work? (disaster recovery)
by talexb (Chancellor) on Jan 09, 2024 at 16:26 UTC
      In the first small company I worked for, the husband and wife business owners took tape backups of our software home with them from the office every Friday night ... so they'd be able to resurrect their business in the event of an office fire or something. I doubt they ever did any serious disaster recovery testing though.

    The only reason you do backups .. is so that you can do a restore.

    And if you haven't tested your backup/restore procedure, then it's a little like Schroedinger's Cat. You don't know if you have a backup .. until you actually successfully do a restore.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Many many years ago at $work, "You're only as good as your last backup" became "You're only as good as your last tested restore" :)

      if you haven't tested your backup/restore procedure, then it's a little like Schroedinger's Cat. You don't know if you have a backup .. until you actually successfully do a restore.

      Hey, it's war story time again! ;-)

      On the last few days of my final year at university, a student-managed little server in my favorite lab had lost a lot of data. I don't remember the exact details, I think it lost an entire harddisk. The server was an old tower PC, build around something like a Pentium-II, with no redundancy at all, all consumer parts, no server parts, filled with old harddisks, and a big fan tied to the front of the case with old wires. I guess all of its parts were picked out of the dumpster. It ran Linux, probably an early version of Debian, and it had a SCSI tape streamer. Actually, two streamers, one online, one "offline" in the spare parts bin.

      Someone has set up a cron job to use tar to write a backup to tape. Great idea, that's what tar was designed for. One of the students must have swapped the tapes each morning. Larger disks were added, and some day, the tape was full. Backup failed. Some "clever" guy must have found tar's -z option to compress data using gzip, and added that option to the cron job. Backup worked again, tapes had some room again. Nobody verified or tested the backup.

      Then, data was lost. Restoring the backups failed. The tapes were worn out and had several read errors, streamers were dirty as hell. tar can handle tapes with errors. It uses fixed-size blocks, and if a block is not readable, it can at least find the next file on tape and continue from there. That way, you won't get all of your data back, but probably a lot of it. Remember the -z option? The cron job wrote a gzip compressed byte stream to the tapes. No more fixed blocks, and gzip absolutely does not like I/O errors while decompressing a compressed data stream. All tape-handling advantages of tar were lost.

      In the end, I had a lot of free time that day, and so I could help recovering data from the tape. We found another large, empty harddisk, and used something like dd if=/dev/tape conv=noerror of=/mnt/tmpdisk/backup.tar.gz to get a damaged, but readable compressed tape archive. It could be decompressed, at least partially, and tar was then able to extract a lot of files. Swapping the streamers allowed to read some more data from the current tape. The other tape could also be read partially, and a few more, but older files were recovered. I left sorting out old and new, damaged and sane files and copying them back to the replacement disk to the admin, and told him to fix some things:

      • get rid of the -z flag to tar in the cron job, NOW
      • get new tapes, preferably longer tapes
      • discard the old, worn-out tapes
      • get a cleaning tape
      • clean up both streamers
      • verify the archive on tape after backup
      • preferably, get another junk PC, connect the second streamer to that PC, and use that PC to actually test data recovery

      In the end, a lot of data was recovered, some from the tapes, some from student PCs in the lab, some from some old disks in the junk bin. But a lot was lost.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re^2: Have you ever lost your work? (disaster recovery)
by LanX (Saint) on Jan 09, 2024 at 22:40 UTC
    > serious disaster recovery testing

    Careful! "Serious" testing sometimes leads to serious problems.

    I don't remember the details anymore but I heard about a test involving "quick" reboots of some key infrastructure of a data center.

    Nobody expected that rebooting too many servers at the same time overwhelmed the peak electric supply, which in turn led to shutdowns.

    Bottom line: Better test the testing!

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery