Verifying 2 databases are in sync with brute force

dwhite20899 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Verifying 2 databases are in sync with brute force by moritz (Cardinal) on May 31, 2011 at 15:35 UTC
It takes 2 hours to dump a backup Iterating over the whole database in Perl will take even longer. So doing the dump on both DB's and comparing checksums of the dumps is probably faster than any full Perl solution. There's also pg-comparator, which might be worth looking at. Another possible trick is to aggregate all rows of a table into a single one using array_agg or one of its companions, and implement a checksum in postgres itself (or maybe there's one available already?). That way less data needs to be transferred out of the DB, making the process much faster. Perl 6 - second systems done right	[reply]
Re^2: Verifying 2 databases are in sync with brute force by dwhite20899 (Friar) on May 31, 2011 at 17:26 UTC
I didn't even consider comparing the dumps - I just took it for granted the files would be different due to dated comments or non-deterministic output. But I'll give it a try! pg-comparator looks interesting. Thanks!	[reply]
Re^3: Verifying 2 databases are in sync with brute force by moritz (Cardinal) on Jun 01, 2011 at 06:29 UTC
I just dumped the same DB twice in a row (with `pg_dump --clean -h localhost -U <user> <dbname>`) and both files are identical, no timestamp information and nothing indeterministic in this particular DB. Perl 6 - second systems done right	[reply] [d/l]
Re^2: Verifying 2 databases are in sync with brute force by perlfan (Parson) on Jun 01, 2011 at 14:32 UTC
I like the aggregation idea. If you had a table with an entry for all other tables (perhaps, even itself) and it contained a checksum of some aggregate of the respective tables, you could be reasonable sure that the databases were in sync. The updating of the checksum table should be an ongoing process to minimize checksumming each entire table each time.	[reply]
Re: Verifying 2 databases are in sync with brute force by Sewi (Friar) on May 31, 2011 at 19:42 UTC
Depends if you have control over the data. Add a "last update" timestamp column to each table (maybe using a server-side-trigger). Next use a perl script to compare each row which has been written since the last check (time): Grab the last-check-timestamp from a DB or file Put the current timestamp into a variable select and verify all changed and new rows Flush the timestamp gathered in step 2 to DB or file You'll get all updated rows and rows which have been deleted on one but not on the other server. You might be able to do a SELECT JOIN if the two servers are connected and limit the output to the differing rows, but this is too much for my Postgres knowledge :-) Try Padre - the free Perl IDE	[reply]
Re: Verifying 2 databases are in sync with brute force by Anonymous Monk on May 31, 2011 at 23:14 UTC
I would also ask this question on the postgreSQL community forum -- they should have some good ideas on the best way to tackle it :)	[reply]
Re: Verifying 2 databases are in sync with brute force by erix (Prior) on Jun 02, 2011 at 18:07 UTC
Here is a comparison between two large tables in two different instances. One instance is postgres 9.0.3, the other 9.1beta (from git://git.postgresql.org/git/postgresql.git). The table compared (named "public.x") has 230M rows. This approach may not particularly fast, but it is simple. (Make sure the order by phrase is such that the order is determined.) And it just compares the table data itself -- there may still be differences, in indexes, for example. (Obviously, -h and -p parameters are specific to my setup.) -- -- (psql): showing rowcount: -- $ select to_char(count(), repeat('G999',4)) as rowcount from public.x +; rowcount ------------------- 232,765,672 (1 row) # # (bash): dump tab-delimited file from the 9.1 instance table # $ time echo "copy (select from public.x order by 1,2,3) to stdout delimiter E'\t'" \ \| psql -qtA -h /tmp -p 6542 -d testdb > public.x_on_testdb.txt real 42m47.107s # # (bash): dump tab-delimited file from the 9.0.3 instance table # $ time echo "copy (select * from public.x order by 1,2,3) to stdout delimiter E'\t'" \ \| psql -qtA -h localhost -p 55432 -d ms > public.x_on_ms.txt real 57m49.028s # # (bash): compare the two output files # $ time md5sum public.x_on_testdb.txt public.x_on_ms.txt 3313cba4d2a1cb14b5ba3dcc5378c33e public.x_on_testdb.txt 3313cba4d2a1cb14b5ba3dcc5378c33e public.x_on_ms.txt [download] Another approach would be to use postgres' md5 function (documentation), and dump the md5 of a text representation of every row, and compare those. HTH... and I am interested how you eventually do the comparison; perhaps you could let us now how it goes?	[reply] [d/l]