dwhite20899 has asked for the wisdom of the Perl Monks concerning the following question:

I've inherited two postgres databases, a master and a backup, on different machines (on same subnet). I need to verify the backup is complete. There are 60 tables, all under 50,000 rows except for 5 tables with 100M+ rows.

It takes 2 hours to dump a backup, a week to load, so just doing a backup isn't an option. (yet)

I'm thinking of using Perl to do brute force comparisons of the small tables. I can automate column named selects based on the schema docs, and fill a hash based on the table keys with the master data, and compare that to the backup content. I may have to use a cursor, and it may be slow, but it's all I can think of. But I think this will be insane on the big tables.

Has anyone an idea of a better method of attack?

  • Comment on Verifying 2 databases are in sync with brute force

Replies are listed 'Best First'.
Re: Verifying 2 databases are in sync with brute force
by moritz (Cardinal) on May 31, 2011 at 15:35 UTC
    It takes 2 hours to dump a backup

    Iterating over the whole database in Perl will take even longer. So doing the dump on both DB's and comparing checksums of the dumps is probably faster than any full Perl solution.

    There's also pg-comparator, which might be worth looking at.

    Another possible trick is to aggregate all rows of a table into a single one using array_agg or one of its companions, and implement a checksum in postgres itself (or maybe there's one available already?). That way less data needs to be transferred out of the DB, making the process much faster.

      I didn't even consider comparing the dumps - I just took it for granted the files would be different due to dated comments or non-deterministic output. But I'll give it a try!

      pg-comparator looks interesting. Thanks!

        I just dumped the same DB twice in a row (with pg_dump --clean -h localhost -U <user> <dbname>) and both files are identical, no timestamp information and nothing indeterministic in this particular DB.
      I like the aggregation idea. If you had a table with an entry for all other tables (perhaps, even itself) and it contained a checksum of some aggregate of the respective tables, you could be reasonable sure that the databases were in sync. The updating of the checksum table should be an ongoing process to minimize checksumming each entire table each time.
Re: Verifying 2 databases are in sync with brute force
by Sewi (Friar) on May 31, 2011 at 19:42 UTC

    Depends if you have control over the data.

    Add a "last update" timestamp column to each table (maybe using a server-side-trigger). Next use a perl script to compare each row which has been written since the last check (time):

  • Grab the last-check-timestamp from a DB or file
  • Put the current timestamp into a variable
  • select and verify all changed and new rows
  • Flush the timestamp gathered in step 2 to DB or file
  • You'll get all updated rows and rows which have been deleted on one but not on the other server.

    You might be able to do a SELECT JOIN if the two servers are connected and limit the output to the differing rows, but this is too much for my Postgres knowledge :-)

Re: Verifying 2 databases are in sync with brute force
by Anonymous Monk on May 31, 2011 at 23:14 UTC
    I would also ask this question on the postgreSQL community forum -- they should have some good ideas on the best way to tackle it :)
Re: Verifying 2 databases are in sync with brute force
by erix (Prior) on Jun 02, 2011 at 18:07 UTC

    Here is a comparison between two large tables in two different instances. One instance is postgres 9.0.3, the other 9.1beta (from git://git.postgresql.org/git/postgresql.git).

    The table compared (named "public.x") has 230M rows.

    This approach may not particularly fast, but it is simple. (Make sure the order by phrase is such that the order is determined.) And it just compares the table data itself -- there may still be differences, in indexes, for example.

    (Obviously, -h and -p parameters are specific to my setup.)

    -- -- (psql): showing rowcount: -- $ select to_char(count(*), repeat('G999',4)) as rowcount from public.x +; rowcount ------------------- 232,765,672 (1 row) # # (bash): dump tab-delimited file from the 9.1 instance table # $ time echo "copy (select * from public.x order by 1,2,3) to stdout delimiter E'\t'" \ | psql -qtA -h /tmp -p 6542 -d testdb > public.x_on_testdb.txt real 42m47.107s # # (bash): dump tab-delimited file from the 9.0.3 instance table # $ time echo "copy (select * from public.x order by 1,2,3) to stdout delimiter E'\t'" \ | psql -qtA -h localhost -p 55432 -d ms > public.x_on_ms.txt real 57m49.028s # # (bash): compare the two output files # $ time md5sum public.x_on_testdb.txt public.x_on_ms.txt 3313cba4d2a1cb14b5ba3dcc5378c33e public.x_on_testdb.txt 3313cba4d2a1cb14b5ba3dcc5378c33e public.x_on_ms.txt

    Another approach would be to use postgres' md5 function (documentation), and dump the md5 of a text representation of every row, and compare those.

    HTH... and I am interested how you eventually do the comparison; perhaps you could let us now how it goes?