History can get complicated very quickly. You haven't said much about how you plan to use the history.

From the database design perspective there are two key questions:

Examples of history being part of the current view include a history of bank transactions, customer service incidents, or inventory levels on a given day (I presume this was the kind of data that ig used timestamps for?). These kinds of records are often identified with time stamps or with unintelligent ids that are generated whenever a new incident happens at a new point in time.

However, there is a lot of data where the relationship with time is a lot more mixed. Consider a child: date of birth, birth weight, genetic parents, social security number, and so forth are fixed and should stay the same throughout the life of the child. Height and weight, hair color, physical health, current residence are all bound to change over time. There will be times when one wants to link to the current version of the child, and there will be times when one will want to link to a specific version of the child.

Identifying the child via timestamp would *not* be a good idea. Everytime the child grew we would have to assign the child a new id and then ripple that out to every record that linked to the child. On the other hand, we may still need a way to identify a particular version of the child. A medical record at age 8 won't make much sense if every relational join connects only to the current version, which happens to be a full grown adult.

Finally, there are databases that keep history for auditing purposes only. For example, records storing data about long ago cashed checks should not change. However, there are rare occasions where one needs to fix typos. Some systems let supervisors review and edit batch accounting transactions before they have been posted. For security reasons, all changes are audited. When the supervisor makes a change, he or she only sees the current record. The snapshots are normally stored in a separate table that is only looked at when checking for fraud or other questionable activity.

In scenarios like the child and the fraud audit, records normally have two identifiers: a time/version independent identifier and a version identifier. The version independent identifier is used whenever a record needs to link to the current view. The version identifier is used whenever a record needs to link to a particular historical version.

In such scenarios, two tables are used for each object. The main table stores only the current records. Its primary key is the version independent identifier. It stores the version identifier as an attribute only. The history is kept in a separate table. In the history table, the version id is, of course, the primary key.

The main table is associated with a trigger that checks for "significant" changes (any change that needs to be audited). Before committing these change and before a final delete of a record, a snapshot of the record is taken and placed in the history table. A union view can be used whenever the history and the current version need to be seen together in reports.

The two table approach has four advantages over a single all versions table: (a) database tools for managing primary keys can be used to insure that the version invariant id is unique. (b) it is optimized for queries that need the current view - one doesn't need to sort through reams of history to display current records. (c) the chance of a broken transaction is lower - modifications have only local impact (the main table and the snapshot table). There is no need to worry about cascading updates across a database. Records that point to the current version will continue pointing to the current version. (d) deletes are easier to manage.

A record is considered deleted if it is missing from the main table. The database tools for handling delete integrity will also continue to work properly, as they would in a non-historical database. If all records are in the same table, the only way to tell if the most recent record is the history of deleted record or the current version of an undeleted record is to add delete flags all over the place. Special triggers would have to be written to fake the delete integrity that comes naturally in a non-historical database.

As for implementing this in Moose or anything outside the database: not a good idea. In all of the types of history discussed above history and data integrity are closely mixed. Anything affecting data integrity should stay inside the database via triggers and stored procedures. This is the only way to insure that applications that view and modify the data always play by the same rules.

Best, beth


In reply to Re: database historisation by ELISHEVA
in thread database historisation by morgon

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.