benz has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Gurus,

I'm pretty much a Perl rookie and was hoping someone here could lend a helping hand.

I need to compare 2 CSV files and output the following:
- which lines have been deleted/added
- which lines have changed according to a specific field

Would really appreciate any help you guys can provide on this. Many thanks in advance!

Cheers,
Ben

Replies are listed 'Best First'.
Re: Comparing CSV files
by davido (Cardinal) on Oct 15, 2004 at 03:20 UTC

    There are two modules that can quickly help you manipulate CSV files. The first is Text::CSV, and the second is its faster but possibly harder to install sibling, Text::CSV_XS. There is also another module which is pretty flexible, allowing you to select any character as the delimiter (not just commas): Text::xSV.

    The rest of your question is a little trickier, and very much dependant on the nature of your CSV data.

    It may be helpful to build sets that can be compared using List::Compare. You will have to decide what constitutes "an added row", versus "a modified row". Is there some unique key in each row that you can use to watch for new keys being added as an indication of new rows? That would be the most ideal situation. But if not, you'll have to decide what combination of fields will come together to create unique keys so that you can determine whether you're lookig at a modified row or a new row.


    Dave

Re: Comparing CSV files
by jZed (Prior) on Oct 15, 2004 at 05:23 UTC
    Well, have a look at my CSV table diff utility. In regard to the other answers so far - davido has suggested the basic modules, they all three can use alternate separators, not just commas. And dragonchild - diff wouldn't do too good of a job if the files had been created as database dumps or with DBD::CSV because even where they had the same rows, they might be in different places in the file depending on what had or hadn't been deleted, inserted, updated, etc.

      DBD::CSV is a good choice, but it only supports very basic SQL statement, otherwise he can see what has been deleted just by saying:

      select foo, bar from table1 minus select foo, bar from table2

      And see what has been added by saying:

      select foo, bar from table2 minus select foo, bar from table1

      Not just minus is not supported, even sub-query is not well supported, so he has to do more coding than he should.

        Fascinating - I've never seen the minus keyword for SQL yet. Up to day, I've always used left outer joins to find out the missing parts between two sets of keys:

        -- in table2 but not in table1: SELECT r.id FROM table2 AS r LEFT JOIN table1 AS l ON r.id = l.id WHERE r.system = ? and l.id IS NULL and r.id IS NOT NULL -- in table1 but not in table2: SELECT l.id FROM table2 AS l LEFT JOIN table1 AS r ON l.id = r.id WHERE l.system = ? and l.id IS NOT NULL and r.id IS NULL SQL

        SQLite doesn't seem to document the minus keyword, so I'll stay with my (left) outer join solution, as SQLite doesn't support the full outer join. Of course, I haven't checked whether DBD::CSV supports outer joins, and I guess it doesn't jZed tells me it does, so these two SQL statements will be put to more and more use. How convenient if your whole business logic can be expressed in two SQL statements.

Re: Comparing CSV files
by dragonchild (Archbishop) on Oct 15, 2004 at 04:18 UTC
    How is what you need any different from 'diff'?

    This smells like homework ...

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      "How is what you need any different from 'diff'?"

      For example, you have a.csv:

      a,1 b,2

      and b.csv:

      b,2 a,1

      When you diff, they are differnt, but if you consider them as database tables, they are really the same. That is the difference.