Re: Archiving data in a database
by BrowserUk (Patriarch) on Oct 04, 2004 at 05:57 UTC
|
A lot will depend upon which type of database (Berkeley type or RDBMS etc,) and which particular one of which ever type you are using.
With a non-relational DB, there are two approaches:
- Backup the entire database (optionally with compression) using standard archiving tools like one of the zipping or arcing utilities.
You would need to do this during a "regular maintainance period" during which the DB is disabled from use. If your DB isn't very big, a few seconds or even minutes at 3.00 am local time (on Sundays?) is probably acceptable for most purposes.
- Alternatively, if you need to keep a transactional record of all the changes, then you would (probably) need to roll your own logging around the interace to the DB.
This isn't too hard to do if you are using a tied hash interface to this type of DB. You only need write a fairly simple tied hash proxy module that intercepts and forward each access to the DB and logs those operations (STORE/DELETE etc.) that modify it.
With an RDBMS, most of these will have facilities for logging record changes built in. It is just a case of reading the docs and learning how to enable them (if they are not so by default) and then how to examine and if necessary, recover information from them.
For example: MySQL has the --log-update[=file_name] (now deprecated) and the --log-bin[=file_name] (prefered) options that you can add to the MySqld startup command line. See the docs for more info.
The problem then becomes one of deciding:
- how much (how long) you need to keep the log data.
This is a mostly a basic question of "How much disc space am I willing to use for this". Though, in some circumstances--legal & financial records etc--the law may dictate how long you must keep such information.
This is much simpler these days as given the low cost of optical storage media (CDs and DVDs), it becomes quite easy and very cheap to simply offload your backups on a regular basis and keep them pretty much forever.
- what actions you envisage using the archived data for?
This is one of those questions that you will only be able to answer once you actually have a need for it.
In theory, if you always have the logging enabled from day one of using a new database, and you keep all the logs, it is possible to "replay" the logs from the beginning, in order, on a restored copy of the bare schema and retrace history back to any point.
In practice, the longer the database is in use, the longer it will take to get back to a point in the recent past, and the more likely that a log has been lost or corrupted leaving you dead in the water.
To avoid this, it is best to take regular (incremental) backups of the database itself and archive these as well as the logs. It can save a great deal of time when you need to retrace things to the state a few days ago.
Finally, the most important two things about backups (DB or otherwise) are
- That the mechanisms for doing them are in-place, simple (read automated) and easily understood by the new guy a year from now.
Eg. It's no good having a really sophisticated backup naming scheme if a year from now when the guy that invented it has moved on, and you really need to be able to recover the transactions from a month ago last Thursday, that whomever is charged with the task can work out how to do it.
- The consistancy (and usability) of the backups should be tested on a regular basis. And that means testing recovering a real piece of data, from a real, live production system--not the test setup!
I won't recount the whole story, but I was working for a large UK retailer who was spending large sums of money for tape rotation, off-site storage of multi-generational backup tapes for each of its 300+ stores. The day came that disaster struck and it was necessary to replace a stores hardware and recover the DB from tape and they were blank. All of them. A simple miniscule difference meant that whilst the rigourously tested system worked just fine in the test store environment, it wrote completely useless tapes in the production environment. A sum of £5 million wasted over 18 months was mentioned.
Tests, test harnesses and test environments are fine things to have, but cross verification in a live environment is paramount.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] [d/l] [select] |
Re: Archiving data in a database
by cleverett (Friar) on Oct 04, 2004 at 05:53 UTC
|
The simplest way I can think of:
- use 2 tables (assuming mysql here, adjust as necessary):
create table d_table (
d_table_id integer unsigned auto_increment primary key,
table_name varchar(32) not null.
d_happened timestamp not null,
operation enum('insert','update','delete') not null,
index (table_name, d_happened)
);
create table d_field (
d_field_id integer unsigned auto_increment primary key,
d_table_id integer unsigned not null,
field_name varchar(32) not null,
old_value text not null,
unique index (d_table_id, field_name)
);
- For table inserts, simply note the insert:
$dbh->do("INSERT INTO d_table table_name = ?, operation = 'insert'", $
+table_name);
- For table updates and deletes, run something like:
my $old_data = $dbh->selectrow_hashref("select * from
table_to_change where primary_key_id = ?", $table_row_id);
## update the table
$dbh->do("INSERT INTO d_table table_name = ?, operation = 'insert'", $
+table_name);
my $d_table_id = $dbh->selectrow_array("select last_insert_id()");
foreach my $f (@updated_field_names) {
$dbh->do("INSERT INTO d_field (d_table_id, field_name, old_value) VA
+LUES (?, ?, ?)",
$d_table_id, $f, $old_data->{$f});
}
I leave the changes needed for a delete as an exercise to the dear reader :)
Using this general scheme you should find it fairly easy to
recapture the state of your database at any time during its existence.
Notes:
- Though I believe in the general soundness of the above,
a bunch of situations exist that could thoroughly ream you,
unless you extend the model to cover them:
- a mysql "ALTER TABLE foo AUTO_INCREMENT = 0", which will reset your row numbers (you can handle this by tracking row numbers for your operations though).
- Global operations will cost you bigtime, but again you can extend this model to turn them into single operations by using an enum field as a flag.
- In your shoes, I would be using $dbh->begin_work(),
$dbh->commit() and $dbh->rollback() from the Perl DBI library to keep the d_table and d_field tables synced with the rest of your database.
- If you have to do extensive coding with this discipline,
create a subclass of the Class::DBI library to automate the
changes to the d_table and d_field tables, instead of coding everything by hand.
| [reply] [d/l] [select] |
Re: Archiving data in a database
by pg (Canon) on Oct 04, 2004 at 04:53 UTC
|
I use two approaches at the same time. (Whether this fits you, really depends on your purpose.)
- I have data warehouse, so "facts" are extracted from the operatinal database periodically, and stored in data warehouse. This is one type of history.
- On the other hand, I also want to know how those "facts" changing in my operational database. I usually just design some event tables, and record what changes (including insert/change/delete) have been made, when, and made by which user. How detail? as long as I can figure out what has happened, when user questions the correctness of the data.
| [reply] |
Re: Archiving data in a database
by astroboy (Chaplain) on Oct 04, 2004 at 05:03 UTC
|
I've done something for this in an Oracle database trigger, but I the same approach can be done in Perl if you want to be database agnostic. I set up an audit table with columns like table_name, primary_key (assumes a single column key, but could be modified for compound keys), value_before, value_after, user_name and date_changed. Every time a DML statement is performed, a row is inserted in the database for each column affected in that statement. Naturally, inserts will have no before values, and deletes will have no after values | [reply] |
|
|
Better do it with trigger, so you only code it once. Otherwise the code will be repeated everywhere in your accessing program, much more difficult to maintain. Even if you capture the logic in library, the programmer still migth forget to call it when he should.
Also trigger is usually faster.
| [reply] |
|
|
Only problem is if you are using MySQL, then you won't have triggers until version 5.
As a side note, I recently had to change an application over to using MySQL instead of Sybase, and I certainly wished the original developer had not bothered with triggers. :-)
| [reply] |
|
|
|
|
Re: Archiving data in a database
by TedPride (Priest) on Oct 04, 2004 at 05:05 UTC
|
pg is correct++. The best way to do this is make a backup copy of the database every so often (monthly usually), and record changes since the last backup in a secondary event file, with timestamp and user (and/or IP) fields to show who made the changes. If anyone questions your data, you just go back to the last log showing (according to the person questioning) accurate data, and print out the events for that record or records. | [reply] |