in reply to Compare 2 csv files using a key set of colums

Whether or not you like dragonchild's alternative, it is still true that you are making a full extra copy of the file data in memory, which you really don't need. You should leave out the "@lines" arrays, and replace the  for my $line (@lines) loops with  while (my $line = <$fh>) loops.

It might also save some memory and speed to use  $hash1->{$key} = undef because all you ever do with the hash1 and hash2 data is check for existence of keys.

It would be helpful to the potential user to be more clear about what exactly the "comparison" consists of, since there are many ways of comparing two csv files, and this script only addresses one way (print any record that contains a "key" value unique to either input file, i.e. not common to both files).

Some users might also like to know (by means of a description in pod, for example) what the limitations are for the code as written: there's no checking for repeated "keys" within a single file, and no checking whether a "common" key has same or different data in other fields in the two files.

Replies are listed 'Best First'.
Re^2: Compare 2 csv files using a key set of colums
by eric256 (Parson) on Dec 19, 2005 at 16:05 UTC

    All very true. The files I use it on are quite small so the script works fine as is. I wont change it till i get a chance/need to test it agian, no point in putting broken code up! ;) But I will certainly update it.

    Regarding the keys, the script makes the assumption, or requirment, that your key defines your unique values, so two rows are identical if there keys match, regardless of the other values. This fits my current needs because I'm comparing to reports to make sure they are outputing the same information, but they have different columns, a subset of which should be the unique key. At that point I care about the extra data in the differences because it helps determine what went wrong in the reports. None of this is by way of an excuse, just an explanation of the cause for the script and hopefully helps explain why it does what it does. It would be rather nice to end up with a generic solution that handles all these cases and perhaps I will work that direction. Thanks for the feed back.


    ___________
    Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;