FubarPA has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

I'm involved in switching a client over to a new platform. Part of the platform change is a change of database platforms (from mainframe to SQL Server). I've been tasked with coming up with a data verification tool to make sure the data did not go corrupt in the move. I've selected Perl to do this task.

Essentially, I have 2 CSV files: One from the mainframe, and one from the SQL server. I need to compare these record by record, field by field, and note any differences. I started down the road of using the DBI interface with CSV files, but I'm not certain if this is the way to go. I have these options available to me:

I'm thinking it may be easier for me to use the CSV file as a driver as I am now, but use ODBC access to the SQL server to do the lookup / compares. Any thoughts would be appreciated. Thanks.

Replies are listed 'Best First'.
Re: Comparing data between data sources
by pbeckingham (Parson) on Aug 17, 2004 at 14:19 UTC

    If you have CSV files already, why not just use:

    % diff old.csv new.csv
    for each of your corresponding files in the set?



    pbeckingham - typist, perishable vertebrate.
      Well, the problem is there are some minor differences that I have to take into account. For example, the mainframe exports null information as ??? in my CSV file, so I either need to replace that with spaces first, or code for that issue. My other problem is that I'm on Windows, and I don't have a diff utility at my disposal at the moment. Good thought, though. Might come in handy for some of the other validations I have to do to other things down the line.
        diff comes with Cygwin which can be installed in local user space. Perl also comes with this environment. A quick search of CPAN yields a multitude of "diff" modules.
Re: Comparing data between data sources
by mpeppler (Vicar) on Aug 17, 2004 at 14:33 UTC
    If the data between the two data sources is supposed to be exactly the same then you could compute MD5 sums for each row and only dig deeper if the MD5 sums differ. You could even start by computing the MD5 sums for the files and check these - if the MD5 sums of two files is the same then the data in the files is almost certainly the same.

    If there are whitespace or other non-significant differences then you'll either have to remove these differences between the two CSV files before performing the MD5 checks.

    Michael

Re: Comparing data between data sources
by jZed (Prior) on Aug 17, 2004 at 17:01 UTC
    I think for your purposes the cygwin diff utility should work fine (you can use it without needing to install cygwin). If you need finer grained comparison, you might be able to build on CSV table diff utility.

    In regards to using ODBC to do the searching - if your files are large, you may gain some speed that way, but trying it yourself is the only way to know how much you'd gain.

Re: Comparing data between data sources
by FubarPA (Monk) on Aug 18, 2004 at 19:06 UTC
    I ended up going with a solution of using DBI for the CSV access and Win32::ODBC to do the lookup to the SQL server. It's not 100% efficient by any means (I'm still pretty new to perl and all), but it gets the job done. To do the compares, I read in the CSV record, look up the corresponding record on the database, and do a field-by-field compare, which works ourt pretty nicely. I actually did get the CSV to CSV lookups working, but it was horribly slow. Thanks for the responses.