pdt1630 has asked for the wisdom of the Perl Monks concerning the following question:

Curious to see if anyone has tackled a Full DB table compare between different table to ensure that copies of a table are in sync with each other. There are a million tools to test structure integrity, but none that I've found address data integrity issue. I have started the design process and am working out the bugs, but would love to hear if anyone has thoughts or code snippets on the problem. If I crack the nut I’ll post here for everyone’s use.

Replies are listed 'Best First'.
Re: Full DB Table Compare
by VSarkiss (Monsignor) on Sep 21, 2001 at 20:03 UTC

    I worked on something like that for a client. The whole system is a publish-and-subscribe paradigm for getting data from other applications around the enterprise, so it needs to ensure data integrity. I didn't build it with DBI, which wasn't available at the client (non-technical reasons made it difficult to get it installed...) but the principles are the same.

    The program reads the database catalogs to get the structure of tables (column names, order, and data types) and to find one or more unique indices on the tables. If the tables you're comparing have identical structures, you can actually generate SQL statements that will make the data in the tables completely the same. (If there are some structure differences -- extra columns, for instance -- it gets a little trickier, but not too bad: it's easier if you have something to tell you "ignore these columns". In my case, they're stored in a controlling table.)

    You also have to decide if you want to create a log or report of the data differences, or if you want the changes resolved "invisibly". In my case, the differences had to be displayed so they could be investigated.

    To make two tables with identical structures have completely the same data, three SQL statements will suffice. Say you have two tables, called T1 and T2, whose unique primary key columns are Ka, Kb, ..., and non-key columns are Va, Vb, .... These are the statements:

    -- 1. remove rows in T2 that are not in T1 delete T2 where not exists ( select 1 from T1 where T1.Ka = T2.Ka and T1.Kb = T2.Kb and ... -- The where clause can be generated by something like: -- join(' and ', map { "T1.$_ = T2.$_" } @key_col_names) ) -- 2. synchronize rows that exist in both update T1 set T1.Va = T2.Va, T1.Vb = T2.Vb, ... from T1, T2 where T1.Ka = T2.Ka and T1.Kb = T2.Kb and ... and ((T1.Va != T2.Va) or (T1.Vb != T2.Vb) ...) -- Check that your platform supports a "from" clause -- in update statements. -- You can actually omit the "value" test; it just eliminates -- unnecessary data movement. Beware that nullable columns -- need more complicated tests. -- 3. insert new rows insert T2 (Ka, Kb, ..., Va, Vb, ...) select Ka, Kb, ..., Va, Vb, ... from T1 where not exists ( select 1 from T2 where T1.Ka = T2.Ka and T1.Kb = T2.Kb and ... )
    You can see how you might generate the statements in Perl if you had arrays of column names (combinations of join and map as in the example above).

    If you want to print a report, things get more complicated and a lot slower. What my program does is to open two result sets from T1 and T2, ordered by primary key columns, then runs a basic sort/merge algorithm to discover the key and value differences. By basic sort/merge, I mean an algorithm like this:

    1. Pull one row from each result set. LOOP: 2. Compare T1 keys to T2 keys 3. If T1 is "larger" Generate a "delete" for T2 keys "Advance" T2, return to loop 4. If T1 is "smaller" Generate an "insert" for T1 keys and values "Advance" T1, return to loop 5. Compare T1 and T2 values 6. If different, generate an "update" statement 7. "Advance" both T1 and T2, return to loop
    An optimization that's not immediately obvious is that once you know the table structures, the comparisons you make in steps 2 and 7 above will always be the same (the column names and whether you use cmp or <=>, that is). Thus, you can create an anonymous sub at the beginning of the loop and call it repeatedly.

    The second process can get pretty slow (hours) if the tables grow large (millions of rows). For such tables, the client only runs the comparison on weekends. By comparison, the SQL-only stuff for a comparably large table runs in minutes.

    HTH

      Well this was the most helpful of the comments I received back from everyone

      I was able to create the script I was looking for under the same ideas that where mentioned... However due to the fact that I needed to work with Oracle Exports and Other variations of db links, we choose to use a working db to handle most of the data manipulation..

      Based on that experience I just wanted to mention to everyone that if you tackle this problem ensure that all the data you are comparing is populated. At least in Oracle anyway, a null value cannot be compared to anything including another null value. So if you don't populate that field, you'll have missing data.

      Another point I want to convey, is if you do happen to use a DB approach versus a hash/array/or whatever combination there of you will need to make sure you create indexes on these working tables you are working with. Otherwise the wait will be quite painful.

      CHEERS

      pdt1630

Re: Full DB Table Compare
by perrin (Chancellor) on Sep 21, 2001 at 18:47 UTC
    If you use TIE::DBI, you could turn it into a comparison of two hashes.
      Does Tie::DBI read the entire recordset into memory? If so, then this might not be a good way to do it for anything other than small tables. I remember once when I took a COBOL class we had to do a comparison like this, what we ended up doing was doing a sort of the two data structures and then a walk through and reconciliation the two sets of records... that may be a better way to do it if Tie::DBI swallows the whole thing into core.
        No, I don't think it loads the whole thing into RAM. It's not all that fast either, but it might simplify the process.