database historisation

morgon has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: database historisation by ELISHEVA (Prior) on Apr 16, 2009 at 22:40 UTC
History can get complicated very quickly. You haven't said much about how you plan to use the history. From the database design perspective there are two key questions: is the history part of the current view or its being kept as an audit trail? how is the object for which you want to keep history used in foreign keys? Are their database tables that always need to be linked to the current version? To a particular historical version? Some combination of both? Examples of history being part of the current view include a history of bank transactions, customer service incidents, or inventory levels on a given day (I presume this was the kind of data that ig used timestamps for?). These kinds of records are often identified with time stamps or with unintelligent ids that are generated whenever a new incident happens at a new point in time. However, there is a lot of data where the relationship with time is a lot more mixed. Consider a child: date of birth, birth weight, genetic parents, social security number, and so forth are fixed and should stay the same throughout the life of the child. Height and weight, hair color, physical health, current residence are all bound to change over time. There will be times when one wants to link to the current version of the child, and there will be times when one will want to link to a specific version of the child. Identifying the child via timestamp would not be a good idea. Everytime the child grew we would have to assign the child a new id and then ripple that out to every record that linked to the child. On the other hand, we may still need a way to identify a particular version of the child. A medical record at age 8 won't make much sense if every relational join connects only to the current version, which happens to be a full grown adult. Finally, there are databases that keep history for auditing purposes only. For example, records storing data about long ago cashed checks should not change. However, there are rare occasions where one needs to fix typos. Some systems let supervisors review and edit batch accounting transactions before they have been posted. For security reasons, all changes are audited. When the supervisor makes a change, he or she only sees the current record. The snapshots are normally stored in a separate table that is only looked at when checking for fraud or other questionable activity. In scenarios like the child and the fraud audit, records normally have two identifiers: a time/version independent identifier and a version identifier. The version independent identifier is used whenever a record needs to link to the current view. The version identifier is used whenever a record needs to link to a particular historical version. In such scenarios, two tables are used for each object. The main table stores only the current records. Its primary key is the version independent identifier. It stores the version identifier as an attribute only. The history is kept in a separate table. In the history table, the version id is, of course, the primary key. The main table is associated with a trigger that checks for "significant" changes (any change that needs to be audited). Before committing these change and before a final delete of a record, a snapshot of the record is taken and placed in the history table. A union view can be used whenever the history and the current version need to be seen together in reports. The two table approach has four advantages over a single all versions table: (a) database tools for managing primary keys can be used to insure that the version invariant id is unique. (b) it is optimized for queries that need the current view - one doesn't need to sort through reams of history to display current records. (c) the chance of a broken transaction is lower - modifications have only local impact (the main table and the snapshot table). There is no need to worry about cascading updates across a database. Records that point to the current version will continue pointing to the current version. (d) deletes are easier to manage. A record is considered deleted if it is missing from the main table. The database tools for handling delete integrity will also continue to work properly, as they would in a non-historical database. If all records are in the same table, the only way to tell if the most recent record is the history of deleted record or the current version of an undeleted record is to add delete flags all over the place. Special triggers would have to be written to fake the delete integrity that comes naturally in a non-historical database. As for implementing this in Moose or anything outside the database: not a good idea. In all of the types of history discussed above history and data integrity are closely mixed. Anything affecting data integrity should stay inside the database via triggers and stored procedures. This is the only way to insure that applications that view and modify the data always play by the same rules. Best, beth	[reply]
Re: database historisation by ig (Vicar) on Apr 16, 2009 at 21:23 UTC
For a service management/inventory/provisioning application I once wrote, I used timestamps for version numbers on all principle records. I also kept a log of committer, comments and an automated overview expression of the change. As you suggest, I had views that presented current state and others that presented the full history. It was simple (crude even) but surprisingly effective. So much so and so subtly/profoundly so that several attempts to replace it through the years were aborted when they realized the replacements weren't as helpful as the original. It was all Perl/PostgreSQL - no triggers or stored procedures. A real developer (I'm just a sysadmin) who supported it after I left said it was very easy to understand but I was obviously not a developer because I hadn't used any of the tools that would have made it easier. But I ramble... I have often wished for the structure of a relational database with the version/update history of repositories like cvs and git. I wish I could add a time or version to my queries to pull out the way it was. The concept is simple but I have never seen anything like it or made it myself. But then I haven't put much effort into finding such a system.	[reply]
Re^2: database historisation by dHarry (Abbot) on Apr 17, 2009 at 06:43 UTC
I have often wished for the structure of a relational database with the version/update history of repositories like cvs and git. Take a look at SciDB. I'm a little bit involved in this initiative. You can few it as the DB of the future;) One of its foreseen features will be a versioning mechanism! Cheers dHarry	[reply]
Re: database historisation by planetscape (Chancellor) on Apr 17, 2009 at 06:09 UTC
While it deals specifically with MS Access databases and VBA, the article Creating an Audit Log, by Allen Browne does give a very good overview of the general steps one needs to take in order to create a database audit trail, as well as mentioning factors you should consider along the way. Perhaps the article will give you some ideas you can use, or at least suggest ways to refine your search to Perl, etc. HTH, planetscape	[reply]
Re: database historisation by roboticus (Chancellor) on Apr 17, 2009 at 13:36 UTC
morgon: Your method is fine, but when I need this capability, I find it best to split the history into a separate table. The way I prefer is to replicate the table structure and add a timestamp column to it1 (along with other information you might want to capture). I do it this way because I tend to think of the primary table as "the current state of things", and in my applications, I find that the current state is more often queried than the history. If your applications deal with history as often as current state, then splitting the data into a separate table may not be a good idea. I prefer the split, though, because it's far too easy to have queries give misleading results when you have to remember to explicitly filter out history. In the cases where I do want the history, I'll join the tables. I do the versioning with triggers so the database can't get itself into an invalid state. (I also apply as many constraints as I can to the database to keep it internally consistent. It's much easier to modify the database structure to accomodate new requests than it is to remove crap from a database once it gets in there.) To prevent the vendor lock-in, keep your triggers trivial, so they can be translated simply. (In fact, if you keep the structure the same, you can automate the translation.) So you might have something like the following (I'm using MS SQL instead of Oracle, as I haven't used Oracle since 42 BC): create table foobar ( foo int primary key, bar varchar(5), ) create table foobar_hist ( foo int, bar varchar(5), change char(1), -- Type of change changed datetime, -- when change occurred constraint PK_foobar_hist primary key (foo, changed) ) /* MS-SQL triggers provide two virtual tables: * INSERTED - holds records to add to the table, and * DELETED - holds records to delete from the table * An update is treated as an insert and delete at the * same time, so you have access to the before and * after values of the columns. */ create trigger foobar_ins on foobar for insert as insert foobar_hist (foo, bar, change, changed) select foo, bar, 'I', getdate() from INSERTED create trigger foobar_upd on foobar for update as insert foobar_hist (foo, bar, change, changed) select foo, bar, 'U', getdate() from INSERTED create trigger foobar_del on foobar for delete as -- Note: Final record is duplicated in history table the -- last I or U record and this D record have the same -- values other than the change/changed columns. insert foobar_hist (foo, bar, change, changed) select foo, bar, 'D', getdate() from DELETED [download] As you can see, I try not to put any rocket science in those triggers. Advantages are: Your primary table and indexes stay as small as possible Your queries for current information remain simple Disadvantages are: More database objects to maintain I hope you find this useful... ...roboticus	[reply] [d/l]
Re^2: database historisation by morgon (Priest) on Apr 17, 2009 at 14:38 UTC
Thanks, what you propose is almost exactly what I have been experimenting on earlier today. Tom Kyte (an Oracle-Guru) however discourages the use of triggers here in favor of a specific Oracle technology, but as I have no experience with Oracle Workspace Manager I'll take your word that such a trigger-approach is tried and tested :-)	[reply]
Re^3: database historisation by roboticus (Chancellor) on Apr 18, 2009 at 13:50 UTC
morgon: That flashback data archive technology certainly looks like it would do the trick. I probably need to update my Oracle knowledge, and then look over the new features of SQL Server 2008 and see if there are similar methods available there. ...roboticus	[reply]
Re: database historisation by dHarry (Abbot) on Apr 17, 2009 at 06:42 UTC
I would like to avoid being locked into Oracle technology It's safe to say you can forget about that, unless you don't use Oracle:) More seriously: avoid using any PLSQL extensions, instead only use plain/standard SQL. Don't use stored procedures, they're in now way standardized. Keeping track of history can be notoriously difficult. Give it lots of thought before you implement something!	[reply]
Re: database historisation by bduggan (Pilgrim) on Apr 17, 2009 at 20:26 UTC
I wrote Class::DBI::Audit for this purpose. It worked well for the case where all database changes went through cdbi objects.	[reply]