I worked on something like that for a client. The whole system is a publish-and-subscribe paradigm for getting data from other applications around the enterprise, so it needs to ensure data integrity. I didn't build it with DBI, which wasn't available at the client (non-technical reasons made it difficult to get it installed...) but the principles are the same.
The program reads the database catalogs to get the structure of tables (column names, order, and data types) and to find one or more unique indices on the tables. If the tables you're comparing have identical structures, you can actually generate SQL statements that will make the data in the tables completely the same. (If there are some structure differences -- extra columns, for instance -- it gets a little trickier, but not too bad: it's easier if you have something to tell you "ignore these columns". In my case, they're stored in a controlling table.)
You also have to decide if you want to create a log or report of the data differences, or if you want the changes resolved "invisibly". In my case, the differences had to be displayed so they could be investigated.
To make two tables with identical structures have completely the same data, three SQL statements will suffice. Say you have two tables, called T1 and T2, whose unique primary key columns are Ka, Kb, ..., and non-key columns are Va, Vb, .... These are the statements:
-- 1. remove rows in T2 that are not in T1
delete T2
where not exists (
select 1 from T1
where T1.Ka = T2.Ka and T1.Kb = T2.Kb and ...
-- The where clause can be generated by something like:
-- join(' and ', map { "T1.$_ = T2.$_" } @key_col_names)
)
-- 2. synchronize rows that exist in both
update T1
set T1.Va = T2.Va, T1.Vb = T2.Vb, ...
from T1, T2
where T1.Ka = T2.Ka and T1.Kb = T2.Kb and ...
and ((T1.Va != T2.Va) or (T1.Vb != T2.Vb) ...)
-- Check that your platform supports a "from" clause
-- in update statements.
-- You can actually omit the "value" test; it just eliminates
-- unnecessary data movement. Beware that nullable columns
-- need more complicated tests.
-- 3. insert new rows
insert T2 (Ka, Kb, ..., Va, Vb, ...)
select Ka, Kb, ..., Va, Vb, ...
from T1
where not exists (
select 1 from T2
where T1.Ka = T2.Ka and T1.Kb = T2.Kb and ...
)
You can see how you might generate the statements in Perl if you had arrays of column names (combinations of join and map as in the example above).
If you want to print a report, things get more complicated and a lot slower. What my program does is to open two result sets from T1 and T2, ordered by primary key columns, then runs a basic sort/merge algorithm to discover the key and value differences. By basic sort/merge, I mean an algorithm like this:
1. Pull one row from each result set.
LOOP:
2. Compare T1 keys to T2 keys
3. If T1 is "larger"
Generate a "delete" for T2 keys
"Advance" T2, return to loop
4. If T1 is "smaller"
Generate an "insert" for T1 keys and values
"Advance" T1, return to loop
5. Compare T1 and T2 values
6. If different, generate an "update" statement
7. "Advance" both T1 and T2, return to loop
An optimization that's not immediately obvious is that once you know the table structures, the comparisons you make in steps 2 and 7 above will always be the same (the column names and whether you use cmp or <=>, that is). Thus, you can create an anonymous sub at the beginning of the loop and call it repeatedly.
The second process can get pretty slow (hours) if the tables grow large (millions of rows). For such tables, the client only runs the comparison on weekends. By comparison, the SQL-only stuff for a comparably large table runs in minutes.
HTH |