Comparing CSV files

benz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Comparing CSV files by davido (Cardinal) on Oct 15, 2004 at 03:20 UTC
There are two modules that can quickly help you manipulate CSV files. The first is Text::CSV, and the second is its faster but possibly harder to install sibling, Text::CSV_XS. There is also another module which is pretty flexible, allowing you to select any character as the delimiter (not just commas): Text::xSV. The rest of your question is a little trickier, and very much dependant on the nature of your CSV data. It may be helpful to build sets that can be compared using List::Compare. You will have to decide what constitutes "an added row", versus "a modified row". Is there some unique key in each row that you can use to watch for new keys being added as an indication of new rows? That would be the most ideal situation. But if not, you'll have to decide what combination of fields will come together to create unique keys so that you can determine whether you're lookig at a modified row or a new row. Dave	[reply]
Re: Comparing CSV files by jZed (Prior) on Oct 15, 2004 at 05:23 UTC
Well, have a look at my CSV table diff utility. In regard to the other answers so far - davido has suggested the basic modules, they all three can use alternate separators, not just commas. And dragonchild - diff wouldn't do too good of a job if the files had been created as database dumps or with DBD::CSV because even where they had the same rows, they might be in different places in the file depending on what had or hadn't been deleted, inserted, updated, etc.	[reply]
Re^2: Comparing CSV files by pg (Canon) on Oct 15, 2004 at 05:36 UTC
DBD::CSV is a good choice, but it only supports very basic SQL statement, otherwise he can see what has been deleted just by saying: `select foo, bar from table1 minus select foo, bar from table2` [download] And see what has been added by saying: `select foo, bar from table2 minus select foo, bar from table1` [download] Not just minus is not supported, even sub-query is not well supported, so he has to do more coding than he should.	[reply] [d/l] [select]
Re^3: Comparing CSV files by Corion (Patriarch) on Oct 15, 2004 at 07:16 UTC
Fascinating - I've never seen the `minus` keyword for SQL yet. Up to day, I've always used `left outer join`s to find out the missing parts between two sets of keys: `-- in table2 but not in table1: SELECT r.id FROM table2 AS r LEFT JOIN table1 AS l ON r.id = l.id WHERE r.system = ? and l.id IS NULL and r.id IS NOT NULL -- in table1 but not in table2: SELECT l.id FROM table2 AS l LEFT JOIN table1 AS r ON l.id = r.id WHERE l.system = ? and l.id IS NOT NULL and r.id IS NULL SQL` [download] SQLite doesn't seem to document the `minus` keyword, so I'll stay with my (left) outer join solution, as SQLite doesn't support the full outer join. Of course, I haven't checked whether DBD::CSV supports `outer join`s, and ~~I guess it doesn't~~ jZed tells me it does, so these two SQL statements will be put to more and more use. How convenient if your whole business logic can be expressed in two SQL statements.	[reply] [d/l]
Re^4: Comparing CSV files by jZed (Prior) on Oct 15, 2004 at 14:04 UTC
Re: Comparing CSV files by dragonchild (Archbishop) on Oct 15, 2004 at 04:18 UTC
How is what you need any different from 'diff'? This smells like homework ... Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Re^2: Comparing CSV files by pg (Canon) on Oct 15, 2004 at 05:29 UTC
"How is what you need any different from 'diff'?" For example, you have a.csv: `a,1 b,2` [download] and b.csv: `b,2 a,1` [download] When you diff, they are differnt, but if you consider them as database tables, they are really the same. That is the difference.	[reply] [d/l] [select]