This lets you compare two csv files and choose which columns should be unique. The code isn't realy pretty but it was a quick hack and i though others might find it usefull. If nothing else now i'll be able to find ti agian if i loose it. ;)

#!/usr/bin/perl use strict; use warnings; use Text::CSV; use Data::Dumper; my ($key, $file1, $file2) = (shift, shift,shift); die "Usage: compare <keycol> <file1> <file2>\n" unless defined $file1 && defined $file2 && defined $key; open( my $fh1, "<", $file1) or die "Failed to open '$file1'"; open( my $fh2, "<", $file2) or die "Failed to open '$file2'"; my $csv = Text::CSV->new; my @key_cols = split ",", $key; my @lines1 = <$fh1>; my @lines2 = <$fh2>; chomp @lines1; chomp @lines2; my ($hash1, $hash2, $hash_all); for my $line (@lines1) { if ($csv->parse($line)) { my @field = $csv->fields; $key = "";$key .= "__" . uc($field[$_]) for @key_cols; $hash1->{$key} = @field; $hash_all->{$key} = $line; } } for my $line (@lines2) { if ($csv->parse($line)) { my @field = $csv->fields; $key = "";$key .= "__" . uc($field[$_]) for @key_cols; $hash2->{$key} = @field; $hash_all->{$key} = $line; } } for my $key (keys %$hash_all) { print "1,", $hash_all->{$key},"\n" unless exists $hash1->{$key}; print "2,", $hash_all->{$key},"\n" unless exists $hash2->{$key}; }

___________
Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;

Replies are listed 'Best First'.
Re: Compare 2 csv files using a key set of colums
by dragonchild (Archbishop) on Dec 13, 2005 at 21:07 UTC
    use Text::xSV; use Set::Object; my $parser = Text::xSV->new; my @keys = split ',', shift(@ARGV); my @sets; foreach my $filename (@ARGV) { $parser->open_file( $filename ); push @sets, Set::Object->new; $parser->read_header; while ($parser->get_row) { my $key = join ',', $parser->extract( @keys ); $sets[-1]->insert( $key ); } } # At this point, you have all the keys in @sets. You can: # $union = $sets[0] + $sets[1]; # $intersection = $sets[0] * $sets[1]; # $difference = $sets[0] - $sets[1]; # $symmetric_difference = $sets[0] % $sets[1];

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      From the looks of it that only inserts the key, mine stores the values so that i can generate a new csv with the difference. This allows me to compare files with different numbers of colums and get the difference. For instance I have two reports, one with patient name and id, the other just with name. Now i don't loose that extra data after the comparison. Minor but in the cases I use it it helps a lot.


      ___________
      Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;
        Yes, that's true. However, adding a per-file hash to store the key-line mapping is so trivial that I shouldn't have to mention it. Or, if memory might become an issue, you can reparse each file to find the necessary lines, which is what I would do.

        I tend to write memory-efficient solutions when dealing with datafiles because I have dealt with 1G+ xSV datafiles. Just because this file is 20K doesn't mean the next file will be.


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Compare 2 csv files using a key set of colums
by graff (Chancellor) on Dec 19, 2005 at 06:55 UTC
    Whether or not you like dragonchild's alternative, it is still true that you are making a full extra copy of the file data in memory, which you really don't need. You should leave out the "@lines" arrays, and replace the  for my $line (@lines) loops with  while (my $line = <$fh>) loops.

    It might also save some memory and speed to use  $hash1->{$key} = undef because all you ever do with the hash1 and hash2 data is check for existence of keys.

    It would be helpful to the potential user to be more clear about what exactly the "comparison" consists of, since there are many ways of comparing two csv files, and this script only addresses one way (print any record that contains a "key" value unique to either input file, i.e. not common to both files).

    Some users might also like to know (by means of a description in pod, for example) what the limitations are for the code as written: there's no checking for repeated "keys" within a single file, and no checking whether a "common" key has same or different data in other fields in the two files.

      All very true. The files I use it on are quite small so the script works fine as is. I wont change it till i get a chance/need to test it agian, no point in putting broken code up! ;) But I will certainly update it.

      Regarding the keys, the script makes the assumption, or requirment, that your key defines your unique values, so two rows are identical if there keys match, regardless of the other values. This fits my current needs because I'm comparing to reports to make sure they are outputing the same information, but they have different columns, a subset of which should be the unique key. At that point I care about the extra data in the differences because it helps determine what went wrong in the reports. None of this is by way of an excuse, just an explanation of the cause for the script and hopefully helps explain why it does what it does. It would be rather nice to end up with a generic solution that handles all these cases and perhaps I will work that direction. Thanks for the feed back.


      ___________
      Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;