Return rows from 2 csv files that don't match

meredib has asked for the wisdom of the Perl Monks concerning the following question:

Ok monks, I have another thing that I'm trying to do. I have 2 spreadsheets; we'll call them $file1 and $file2. What I'm trying to do is compare the 2 spreadsheets and return as output the lines that are not similar. I tried to use one module called csvdiff.pm and that didn't really do what I wanted and I tried some code that used while and if statements but I couldn't get anything. All I want to do is return the rows that don't match from $file2. So I would return lines 2 and 3 of $file2 from the example below. How do I need to go about accomplishing this? Oh and I forgot to mention the reason those 2 lines matched was because of the IP address within the line.

$file1 looks like

,Connectors and Reporting Devices,,,Week of 11/16-22,,,,,,

Count,Agent Name,Device Address,Device Host Name,Device Product,Device
+ Vendor,,,1/3/2009,1/4/2009,1/5/2009
2,Syslog,10.255.33.15,name1,Switch,Cisco,,,0,0,0
2,Syslog,10.64.104.13,name2,Pix,Cisco,,,144,143,144
2,Syslog,10.64.11.52,name3,Pix,Cisco,,,2511,2043,2806

and $file2 looks like

Agent Name,Device Address,Device Host Name,Device Product,Device Vendo
+r,Count(Event ID)
Syslog,10.64.104.12,name2,PIX,CISCO,144
Syslog,10.64.104.14,name4,Unix,Unix,46
Syslog,10.64.106.13,name5,PIX,CISCO,144
[download]

Comment on Return rows from 2 csv files that don't match Download Code

Replies are listed 'Best First'.
Re: Return rows from 2 csv files that don't match by mr_mischief (Monsignor) on Feb 18, 2009 at 19:34 UTC
As is oftne the case, the data structure you use makes a world of difference. In this case, I'd recommend a hash of the IP addresses found in file1. You could also store the actual line that triggered it for diagnostics, reporting, or debugging. This first example just records that file1 had the IP present. `my %hash; while ( my ( $ip, $other_stuff ) = parse_values_file_1() ) { $hash{ $ip } = 1; } while ( my ( $ip, $other_stuff ) = parse_values_file_2() ) { print "Not found: $ip ($other_stuff)\n" unless $hash{ $ip }; }` [download] I'll leave the implementation of the parsing as an exercise for now. What you have here is a check for uniqueness. Whenever you think "uniqueness test" and "Perl", think "I can probably use a hash for that".	[reply] [d/l]
Re^2: Return rows from 2 csv files that don't match by meredib (Acolyte) on Feb 19, 2009 at 18:59 UTC
Thank you I figured it out. I didn't even think about using a hash but what you said is a good rule of thumb. #!perl use Text::CSV; $file1 = 'Master Sched.csv'; $file2 = 'Devices_01-03-2009-09-00-00.csv'; $csv = Text::CSV->new(); open IN1, "<"."$file1" or die "Can't open input file >$file1<\n"; while(<IN1>){ chomp; $csv->parse($_); @subIn = $csv->fields(); $agentName{$subIn[1]} = $_; $IP{$subIn[2]} = $_; $hostName{$subIn[3]} = $_; $product{$subIn[4]} = $_; foreach $x ( $agentName{$subIn[1]} ){ #print "$agentName{$subIn[1]}\n"; } } close IN1; open OUT, ">>"."$file1" or die "Can't open output file >$file1<\n"; print OUT "\n"; print OUT ",Differences\n"; open IN2, "<"."$file2" or die "Can't open input file >$file2<\n"; while ($in2 = <IN2>){ if($in2 =~ /Agent Name/){ next; } @subFlds = split(/,/, $in2); if($in2 =~ /\d+\.\d+\.\d+\.\d+/ \|\| $subFlds[2] ne ''){ chomp $in2; $subFlds[1] =~ s/^\s+//; $subFlds[1] =~ s/\s+$//; $w = $agentName{$subFlds[0]}; $x = $IP{$subFlds[1]}; $y = $hostName{$subFlds[2]}; $z = $product{$subFlds[3]}; #print "$in2\n"; #print length($w)."w\n"; #print length($x)."x\n"; #print length($y)."y\n"; #print length($z)."z\n"; if(length($w) > 0 && length($x) > 0 && length($y) > 0){ $csv->parse($IP{$subFlds[1]}); @subFlds = $csv->fields(); #print "$in2\n"; }else{ print OUT ",$in2\n"; } } } [download]	[reply] [d/l]
Re: Return rows from 2 csv files that don't match by Tanktalus (Canon) on Feb 18, 2009 at 23:18 UTC
If I could clean up the files to just have the lines I was interested in comparing, then using DBD::CSV would probably just be a query like "`SELECT * FROM FILE1 WHERE IP NOT IN (SELECT IP FROM FILE2)`". Or something like that. Now, granted, DBD::CSV does take a bit more to set up. But once you have it set up, there's just so much more you can do with SQL with so little code that it may be worth it. I was hoping, though, that I could see a way to get DBD::CSV to read from a filehandle rather than a file, so I could use an in-memory file as a demo, but I don't see that. Oh well.	[reply] [d/l]