in reply to Testing database updates.

Most DBs have a dump utility that will produce a plain text file as output. These can then be diff'd in the usual ways.

Start with a known state -- eg. empty -- and use the existing script to perform a range of inserts updates and deletes. Dump the result and call it the reference.

Start with the same known state, use the refactored script to perform the same operations and dump the result. Diff the two dumps.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^2: Testing database updates.
by tobyink (Canon) on Apr 18, 2012 at 12:36 UTC

    I concur, with one proviso: you probably want to make sure that your dump script dumps records in a predictable order. Most of the dump scripts bundled with databases do not guarantee any particular order of results. You may need to write your own simple dump script.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      Though these dump utilities don't offer any particular ordering, in general, given the same versions & switches etc. they will produce the same ordering for the same source material. Ie. They don't have a random component.

      Whilst it is possible that two DBs that were arrived at via different routes -- ie. a different sequencing of inserts, updates and deletes -- may produce subtly different dumps due to that ordering, that the OP might decide to consider as "identical", it will require a decision to consider them so.

      That is, whilst diff'ing the dumps may produce "false negatives", it will require a human's decision to make the determination that they are indeed false.

      There is also the possibility to post process the dumps to achieve some 'canonical' format -- eg. trim whitespace; sort etc. -- but that should rarely be necessary.

      For the purposes of checking refactoring, it is safe to assume that the OP would be targeting the same versions of the DBMS & utility for both the reference and check dumps; and any variations between them will be of interest, even if ultimately dismissed as irrelevant.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Values from sequences (e.g. postgres' serial datatype) can (validly) vary and will often thwart the otherwise so attractively simple dump & diff approach.

        And IMHO tobyink is right: without index, order is not garantueed, and I'd consider it unwise to rely on order in the database on the basis of the order of the source material.

        Just a general remark: the OP doesn't give much detail, and doesn't mention what system he works with, but in case the system is PostgreSQL: it has a md5() function which can be handy for directly comparing data (and searching for differences via SQL).

Re^2: Testing database updates.
by chrestomanci (Priest) on Apr 19, 2012 at 16:51 UTC

    Thanks for your advice.

    The system I decided on was to modify the script I am re-factoring to take the path of an SQLite database on the command line. I then wrote a wrapper .t test script that creates an empty DB with just the schema before each test run. After each test run the DB is dumped to text and compared with the expected output.

    My personal take is that I'd much rather start out looking at everything and explicitly choose what I can disregard -- probably through post processing the dumps -- than to prejudge the issues by preselecting what to compare.

    I am post processing the dumps as well. I discovered almost immediately that one table contains a timestamp collum that defaults to the current time, so I had to replace that time stamp with a fixed string. I am also post processing the text output to suppress differences in newline chars and trailing whitespace.

    I dare say that I will soon have to pass the XML output through a tidy program to remove non syntactic differences in spacing & newlines, or just load both the actual and expected output with XMLin (from XML::Simple), and pass the passed data to is_deeply()

      I dare say that I will soon have to pass the XML output

      XML? Did you choose that format? Looking at the help from my copy of sqllite3 XML isn't even an option:

      .mode MODE ?TABLE? Set output mode where MODE is one of: csv Comma-separated values column Left-aligned columns. (See .width) html HTML <table> code insert SQL insert statements for TABLE line One value per line list Values delimited by .separator strin +g tabs Tab-separated values tcl TCL list elements

      Unless you chose HTML? Or did they add an XML option recently?

      Choosing XML over the ,dump SQL format (IMO) is a big mistake. The SQL format is so simple to manipulate and compare:

      sqlite> .dump onegb PRAGMA foreign_keys=OFF; BEGIN TRANSACTION; CREATE TABLE onegb ( alpha varchar, no varchar, hex varchar, bin varch +ar ); INSERT INTO "onegb" VALUES('nvbqurkn','05812224','42295f70','"Mƒp®Iÿ&# +9618;SpæÕ¦&#9500;ƒ±¦c²NB͹sÙ-¼&#9488;ºâ5ó<Ú="'); INSERT INTO "onegb" VALUES('xzpviekv','72456229','fd245120','"ª¿.&#957 +1;¡ù©=àô²Ð&#9571;j&#8215;Èl&#9565;°¾©&#9619;º§~zo=¬fIí&#9516;Z"'); INSERT INTO "onegb" VALUES('qrkhwqwl','43135665','316f7a36','"æe&#9559 +;´Í&#9492;&#9580;&#9524;&#9562;&#9556;¸Éwf±Û6lòêqc¸&#9604;iZñƒ&#9508; +}E&#8962;¤&#9508;"'); ...

      (Okay. That's not the most complex schema in the world; but one advantage of sqlite for this kind of thing is that it actually uses very few different datatypes.)

      Whereas comparing XML is a nightmare. Less so I suppose if you don't have to worry about the attributes/nested sub-items equivalence -- I assume that however you are generating the XML, it at least produce consistent XML.

      I think you are making hard work for yourself using XML, but as I said above somewhere, it is your choice. Good luck :)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I think you miss understood me. I am not using XML as a dump format for my database. I am using the default export format of SQL insert statements.

        The reason I need to parse XML, is because the script I am re-factoring emits results in the form of XML and text files as well as inserts into a database. I need to make sure the file output is functionally identical as well as the database output.

        In terms of comparing the XML, I think using XMLin will be sufficient, as most of the consumers of the XML files are other perl scripts (that I don't control), so if the before and after XML is identical as far as XML::Simple is concerned, then it should be acceptable to those consuming perl scripts.