Your Mother has asked for the wisdom of the Perl Monks concerning the following question:

I am certain this problem has been solved by those with better engineering than mine. Please save me from endless experiments with Algorithm::Diff::Apply and Re: Yet another Algorithm::Diff question that would be scrapped when I finally learned the right thing which you'll tell me now. (Update: with an immediate (--) I should clarify, I have spent quite a few hours working on this on and off; I'm not lazy, I just know I'm on the wrong track.)

How should one track revisions in a relational db? I've seen rcs, svn, and cvs for file based stuff like Kwiki, but there must be a text/string based way (can't save A::D::diff output without serializing it) without files and without asking the shell for help.

  • Comment on Revision control in relational databases

Replies are listed 'Best First'.
Re: Revision control in relational databases
by Corion (Patriarch) on Jul 20, 2004 at 07:20 UTC

    If you don't mind changing your database schema and access logic, the easiest way is to extend your database schema:

    create table sometext ( id INTEGER PRIMARY KEY, content VARCHAR(4096), previous_version INTEGER )

    Then, selecting the current version of something becomes easy (modulo indices on the previous_version column), updating a version becomes easy and building the diff between two versions is a matter of Algorithm::Diff.

    Of course, with MySQL, which has problems with accessing reverse indices, this might not be a good solution, and if you're driving a system like Perlmonks from the database, (read) access to the current version is far more often than access to previous versions. Then you might want to consider either partitioning your table so that the current versions all live in the same partition or splitting your table into two tables, sometext and old_sometext, where the current version always lives in sometext. With that approach you lose the ability to have multiple branches though, as you will only have one active version.

    If you want version/patch integrity across tables, you will first need atomic commits across tables and you will also need a global version number/commit number that will allow you to cross-reference a selection of "patches" against the database. Svk does that (for text files), and I think it would be good to steal the ideas from it.

    If you want magic rollback/versioning without changing your software, I believe there is nothing like that except cranking up your DBI logging, copying the database during the night, and then replaying the logs as you need to move forward. Moving backwards is most likely not possible unless you write an SQL parser that knows how to undo stuff or write your DB emulator, which saves the steps with undo information and thus allows you to "move backwards" in your DBI log.

    If you only want the differences between two versions of the database, your best bet is to dump both versions as text and then to run diff over the text files :-).

      With that approach you lose the ability to have multiple branches though, as you will only have one active version.

      Unless you store fulltext for BOTH the original and current version, with all intermediate versions being diffs up from the original. With a previous_version column you can have multiple branches (for example in case of simultaneous updates). To get an intermediate version, you start with the diff for that version, get all diffs from all previous_versions and then the base fulltext, and apply the diffs in order to the base fulltext.

      The latest version has a diff, a previous_version AND fulltext of the post. When there is an update, the fulltext is stripped off. The new (current) version can be diffed against that fulltext first, list the formerly current version as previous_version and store its own fulltext.

Re: Revision control in relational databases
by DaveH (Monk) on Jul 20, 2004 at 07:18 UTC

    Hi.

    One approach is to store a VERSION column for each piece of content which you store in the database, and just store the whole lot each time. So rather than taking a diff, you just store everything. Don't over-engineer if your problem is very simple.

    That said, this isn't very scalable. Obviously, storing the first revision once, then just subsequently storing context diffs against version 1 is much cheaper in terms of storage. Have a look at Text::Diff on CPAN for a simple interface to Algorithm::Diff for strings (text).

    Cheers,

    -- Dave :-)


    $q=[split+qr,,,q,~swmi,.$,],+s.$.Em~w^,,.,s,.,$&&$$q[pos],eg,print
Re: Revision control in relational databases
by EdwardG (Vicar) on Jul 20, 2004 at 08:31 UTC

    When I read the title of this node I assumed you wanted to control versions of source code stored in a database - in other words the stored procedures, views, triggers, etc etc, but at least one response seems to assume you want to version control the DATA stored in your database. A third interpretation could be that you want to version control the table structures, indexes etc (the schema). Perhaps you want all of this, and more?

    All of which is to say that your question is vague, and that answers will tend to be helpful only in direct proportion.

     

Re: Revision control in relational databases
by Joost (Canon) on Jul 20, 2004 at 10:04 UTC
    hmm... as noted above your question is vague, but considering this:
    there must be a text/string based way (can't save A::D::diff output without serializing it) without files and without asking the shell for help.
    I think you want to take a look at Text::Diff.

    By the way, revisions in relational databases might be harder than you think, considering the "relational" part of the database. Diff et al only fix the "easy" part of the problem.

Re: Revision control in relational databases
by Anonymous Monk on Jul 20, 2004 at 12:00 UTC
    Don't overlook the features of your database. For instance Oracle has flashback queries
Re: Revision control in relational databases
by tbone1 (Monsignor) on Jul 20, 2004 at 12:19 UTC
    Um, wouldn't a standard three-table audit scheme (with Change History, Work In Progress, and Official versions of each table) pretty much take care of this?

    --
    tbone1, YAPS (Yet Another Perl Schlub)
    And remember, if he succeeds, so what.
    - Chick McGee

Vagueness and patched strings
by Your Mother (Archbishop) on Jul 20, 2004 at 14:32 UTC

    Yeah, I ended up being a little vague after writing a really long one trying not to cloud it. Erg.

    Doing multiple records/rows and then using the most recent did occur to me but I figured there might be a lower memory way or A Way It Is Done(tm). My conception of how I would do it if I could get it was:

    create table changes ( id char(10), # DB's other tables all have uniquely prefixed ids diff mediumtext, timestamp datestamp );

    And then stringify + concat all the new values of the cols, diff it against the old... It would need to do diffs and patches on strings. Text::Diff looks helpful for the first part at least. Patching (reverting) doesn't seem as obvious.

    Now that this problem is broken down against the simplicity of the problem (as DaveH and Corion mention) I think repeated records would be easier to implement less and prone to problems over all. Thanks!

      If you do ever decide to try and implement a diff system (I have thought about this problem a lot myself), remember to think about how your system will handle two simultaneous updates to a piece of content. There could be a race condition in the system you've outlined where, as you are doing the SELECTing and diffing required to generate the new change entry, a new row is entered in changes, for example from an update begun just before yours, for the same id. Now your diff is totally wrong.

      Also, I don't know about the latest versions of MySQL, but in my 3.x datestamp only is granular to the second level, and it is not inconceivable for a central piece of content on a busy site like a Wiki to get two changes entered within the same one-second window, so there would be no way to select which change to apply first because they both have the same timestamp. (Although this eventuality could be mitigated by designating PRIMARY KEY(id, timestamp) which should throw an error on simultaneous update)

        Good thoughts, and from others above. Thanks, as usual to all. I'm on MySQL 4.whatever but trying to keep things pretty generic just in case. The sad part is, the last place I worked did just what I want (within a Berkeley DB) and I wasn't interested in checking out the code at the time so I have no idea how it worked.

      There doesn't seem to be any mention of *which* RDBMS you're using. If you are using MS SQL Server I can suggest that you check out Red Gate's offering. They have tools that will compare both data and schema and there is also a programmer's toolkit for .NET if you care to depart from our beloved Perl :-)

      -Jim