Re: Archiving data in a database

A lot will depend upon which type of database (Berkeley type or RDBMS etc,) and which particular one of which ever type you are using.

With a non-relational DB, there are two approaches:

Backup the entire database (optionally with compression) using standard archiving tools like one of the zipping or arcing utilities.
You would need to do this during a "regular maintainance period" during which the DB is disabled from use. If your DB isn't very big, a few seconds or even minutes at 3.00 am local time (on Sundays?) is probably acceptable for most purposes.
Alternatively, if you need to keep a transactional record of all the changes, then you would (probably) need to roll your own logging around the interace to the DB.
This isn't too hard to do if you are using a tied hash interface to this type of DB. You only need write a fairly simple tied hash proxy module that intercepts and forward each access to the DB and logs those operations (STORE/DELETE etc.) that modify it.

With an RDBMS, most of these will have facilities for logging record changes built in. It is just a case of reading the docs and learning how to enable them (if they are not so by default) and then how to examine and if necessary, recover information from them.

For example: MySQL has the --log-update[=file_name] (now deprecated) and the --log-bin[=file_name] (prefered) options that you can add to the MySqld startup command line. See the docs for more info.

The problem then becomes one of deciding:

how much (how long) you need to keep the log data.
This is a mostly a basic question of "How much disc space am I willing to use for this". Though, in some circumstances--legal & financial records etc--the law may dictate how long you must keep such information.
This is much simpler these days as given the low cost of optical storage media (CDs and DVDs), it becomes quite easy and very cheap to simply offload your backups on a regular basis and keep them pretty much forever.
what actions you envisage using the archived data for?
This is one of those questions that you will only be able to answer once you actually have a need for it.
In theory, if you always have the logging enabled from day one of using a new database, and you keep all the logs, it is possible to "replay" the logs from the beginning, in order, on a restored copy of the bare schema and retrace history back to any point.
In practice, the longer the database is in use, the longer it will take to get back to a point in the recent past, and the more likely that a log has been lost or corrupted leaving you dead in the water.
To avoid this, it is best to take regular (incremental) backups of the database itself and archive these as well as the logs. It can save a great deal of time when you need to retrace things to the state a few days ago.

Finally, the most important two things about backups (DB or otherwise) are

That the mechanisms for doing them are in-place, simple (read automated) and easily understood by the new guy a year from now.
Eg. It's no good having a really sophisticated backup naming scheme if a year from now when the guy that invented it has moved on, and you really need to be able to recover the transactions from a month ago last Thursday, that whomever is charged with the task can work out how to do it.
The consistancy (and usability) of the backups should be tested on a regular basis. And that means testing recovering a real piece of data, from a real, live production system--not the test setup!
I won't recount the whole story, but I was working for a large UK retailer who was spending large sums of money for tape rotation, off-site storage of multi-generational backup tapes for each of its 300+ stores. The day came that disaster struck and it was necessary to replace a stores hardware and recover the DB from tape and they were blank. All of them. A simple miniscule difference meant that whilst the rigourously tested system worked just fine in the test store environment, it wrote completely useless tapes in the production environment. A sum of Ł5 million wasted over 18 months was mentioned.

Tests, test harnesses and test environments are fine things to have, but cross verification in a live environment is paramount.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re: Archiving data in a database Select or Download Code