comparing arrays

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: comparing arrays by ikegami (Patriarch) on Sep 21, 2004 at 15:11 UTC
`local FILE2; open(FILE2, '<', 'file2') or die(...); my %file2 = map { /^(\S+)/; ($1 => $_) } <FILE2>; close(FILE2); local FILE1; open(FILE1, '<', 'file1') or die(...); while (<FILE1>) { chomp; print($file2{$_}) if (exists($file2{$_})); } close(FILE1);` [download]	[reply] [d/l]
Re: comparing arrays by Limbic~Region (Chancellor) on Sep 21, 2004 at 15:34 UTC
Anonymous Monk, Given that I have no idea how large your files are, reading everything into memory might not be feasible. OTOH, looping through file2 as many times as there are entries in file1 may also be too time consuming. I have comprimised by caching the offset in file2. #!/usr/bin/perl use strict; use warnings; my $file_1 = $ARGV[0] \|\| 'file1.txt'; my $file_2 = $ARGV[1] \|\| 'file2.txt'; open (FILE1, '<', $file_1) or die "Unable to open $file_1 for reading +: $!"; open (FILE2, '<', $file_2) or die "Unable to open $file_2 for reading +: $!"; my %offset = ( _pos => 0 ); while ( <FILE1> ) { chomp; if ( defined $offset{ $_ } ) { seek FILE2, $offset{ $_ }, 0; print scalar <FILE2>; next; } else { seek FILE2, $offset{_pos}, 0; my $pos = tell FILE2; while ( my $line = <FILE2> ) { my ($col1) = $line =~ /^(\d+)/; $offset{ $col1 } = $pos; $pos = tell FILE2; if ( $col1 eq $_ ) { print $line; $offset{_pos} = $pos; last; } } } } [download] This is fully functional and should be a comprimise between speed and memory. Cheers - L~R Update: Added optimization so that each line from file 2 is read a maximum of 2 times	[reply] [d/l]
Re: comparing arrays by rjbs (Pilgrim) on Sep 21, 2004 at 15:31 UTC
`open(my $master_file, '<', "file1") or die "couldn't open master"; my %valid = map { chomp; ($_ => 1) } <$master_file>; close $master_file; open(my $data_file, '<', "file2") or die "couldn't open data file"; while (<$data_file>) { my ($key) = split /\s/; print if $valid{$key}; } close $data_file;` [download] We create a hash of good values from the masterfile, for quick lookup. Then we iterate over the lines in the data file, printing them only if the first value is a valid key. rjbs	[reply] [d/l]
Re^2: comparing arrays by Limbic~Region (Chancellor) on Sep 21, 2004 at 16:46 UTC
rjbs, Though the AM didn't state it as a requirement, I wanted to point out that your solution does not preserve order. Cheers - L~R	[reply]
Re: comparing arrays by radiantmatrix (Parson) on Sep 21, 2004 at 15:51 UTC
Update: I now realize I was unintentionally redundant. I opened the reply form, then got distracted; by the time I submitted, someone else had come up with essentially the same concept. My apologies! On the upside, that does prove that it's a good idea. :) Do this, maybe: `while <$FILE_1> { $file1{$_}=0; } while <$FILE_2> { my ($match_val) = split(/\s+/, $_); #split on whitespace print $_ if defined $file1{$match_val}; }` [download] Searching hash keys is faster than a linear array search (especially for large constructs). The first loop loads a hash where the keys are the data from file 1 (the values don't matter here). The second loop prints each line in file2 that has a value in its first column that matches a hash key. Should be pretty fast, and has the added advantage of not reading all of file 2 into memory. ^{-- $me = rand($hacker{perl});} All code, unless otherwise noted, is untested	[reply] [d/l]
Re^2: comparing arrays by Anonymous Monk on Sep 21, 2004 at 16:16 UTC
thanks all for many good and working suggestions	[reply]
Re: comparing arrays by ambrus (Abbot) on Sep 21, 2004 at 17:26 UTC
The simple solutions is using textutils ~~`join <(sort -n file1) <(sort -n file2)`~~ (Update: the solution above is wrong. Thanks to L~R for warning me about it. The corrected version is below, which btw finds matches only if the numbers in the first column match textually, not only numerically.) `join <(sort -b file1) <(sort -b file2)` [download] And here's a perl solution, dedicated to merlyn. `use warnings; use strict; use Quantum::Superpositions; my $s = do { open my $e, "<", "file1" or die 1; any(<$e>); }; { open my $m, "<", "file2" or die 2; while(<$m>) { $_=~/(\S+)/ and $1==$s and print; }; } __END__` [download] Update 2009 sep 2. See Re^2: Joining two files on common field for a list of other nodes where unix textutils is suggested to merge files.	[reply] [d/l] [select]
Re: comparing arrays by McMahon (Chaplain) on Sep 21, 2004 at 15:27 UTC
This is my favorite answer to my favorite question: List::Compare	[reply]
Re^2: comparing arrays by fergal (Chaplain) on Sep 21, 2004 at 15:38 UTC
Are you sure List::Compare applies here? The lines are not identical so they will be considered identical by List::Compare.	[reply]
Re: comparing arrays by TedPride (Priest) on Sep 21, 2004 at 21:28 UTC
Assuming the lines are in order, as shown above... `open(INPA, $inpa) \|\| die "Can't open $inpa"; open(INPB, $inpb) \|\| die "Can't open $inpb"; my $a = <INPA>; chomp($a); my $b = <INPB>; while ($a && $b) { $b =~ /^(\d+) /; if ($a < $1) { $a = <INPA>; chomp($a); } elsif ($a == $1) { print $b; $a = <INPA>; chomp($a); $b = <INPB>; } else { $b = <INPB>; } } close(INPA); close(INPB);` [download] The advantage of this code is it's simple and easy to edit for other formats by changing the regular expression (currently set for one or more digits followed by a space) and comparisons (change to lt, eq, gt for string keys). It also doesn't require huge arrays or hashes.	[reply] [d/l]
Re: comparing arrays by graff (Chancellor) on Sep 22, 2004 at 03:01 UTC
I need to do this sort of thing (and similar related things) a lot in my work, so I wrote my own command line utility to handle it, and posted it here (cmpcol) at PM. For the case you cited, the command line would be: `cmpcol -i -l2 file1 file2` [download] where "-i" means "output the intersection of the two files", and "-l2" means "output full lines from file2 for matches". It has lots of other bells and whistles (union or exclusive-or instead of intersection, using other columns in either file instead of the default first column, etc). HTH.	[reply] [d/l]