Re: comparing arrays
by ikegami (Patriarch) on Sep 21, 2004 at 15:11 UTC
|
local *FILE2;
open(FILE2, '<', 'file2')
or die(...);
my %file2 = map { /^(\S+)/; ($1 => $_) } <FILE2>;
close(FILE2);
local *FILE1;
open(FILE1, '<', 'file1')
or die(...);
while (<FILE1>) {
chomp;
print($file2{$_}) if (exists($file2{$_}));
}
close(FILE1);
| [reply] [d/l] |
Re: comparing arrays
by Limbic~Region (Chancellor) on Sep 21, 2004 at 15:34 UTC
|
Anonymous Monk,
Given that I have no idea how large your files are, reading everything into memory might not be feasible. OTOH, looping through file2 as many times as there are entries in file1 may also be too time consuming. I have comprimised by caching the offset in file2.
#!/usr/bin/perl
use strict;
use warnings;
my $file_1 = $ARGV[0] || 'file1.txt';
my $file_2 = $ARGV[1] || 'file2.txt';
open (FILE1, '<', $file_1) or die "Unable to open $file_1 for reading
+: $!";
open (FILE2, '<', $file_2) or die "Unable to open $file_2 for reading
+: $!";
my %offset = ( _pos => 0 );
while ( <FILE1> ) {
chomp;
if ( defined $offset{ $_ } ) {
seek FILE2, $offset{ $_ }, 0;
print scalar <FILE2>;
next;
}
else {
seek FILE2, $offset{_pos}, 0;
my $pos = tell FILE2;
while ( my $line = <FILE2> ) {
my ($col1) = $line =~ /^(\d+)/;
$offset{ $col1 } = $pos;
$pos = tell FILE2;
if ( $col1 eq $_ ) {
print $line;
$offset{_pos} = $pos;
last;
}
}
}
}
This is fully functional and should be a comprimise between speed and memory.
Update: Added optimization so that each line from file 2 is read a maximum of 2 times
| [reply] [d/l] |
Re: comparing arrays
by rjbs (Pilgrim) on Sep 21, 2004 at 15:31 UTC
|
open(my $master_file, '<', "file1") or die "couldn't open master";
my %valid = map { chomp; ($_ => 1) } <$master_file>;
close $master_file;
open(my $data_file, '<', "file2") or die "couldn't open data file";
while (<$data_file>) {
my ($key) = split /\s/;
print if $valid{$key};
}
close $data_file;
We create a hash of good values from the masterfile, for quick lookup. Then we iterate over the lines in the data file, printing them only if the first value is a valid key.
| [reply] [d/l] |
|
|
rjbs,
Though the AM didn't state it as a requirement, I wanted to point out that your solution does not preserve order.
| [reply] |
Re: comparing arrays
by radiantmatrix (Parson) on Sep 21, 2004 at 15:51 UTC
|
Update:
I now realize I was unintentionally redundant. I opened the reply form, then got distracted; by the time I submitted, someone else had come up with essentially the same concept. My apologies! On the upside, that does prove that it's a good idea. :)
Do this, maybe:
while <$FILE_1> {
$file1{$_}=0;
}
while <$FILE_2> {
my ($match_val) = split(/\s+/, $_); #split on whitespace
print $_ if defined $file1{$match_val};
}
Searching hash keys is faster than a linear array search (especially for large constructs). The first loop loads a hash where the keys are the data from file 1 (the values don't matter here). The second loop prints each line in file2 that has a value in its first column that matches a hash key.
Should be pretty fast, and has the added advantage of not reading all of file 2 into memory.
--
$me = rand($hacker{perl});
All code, unless otherwise noted, is untested
| [reply] [d/l] |
|
|
thanks all for many good and working suggestions
| [reply] |
Re: comparing arrays
by ambrus (Abbot) on Sep 21, 2004 at 17:26 UTC
|
The simple solutions is using textutils
join <(sort -n file1) <(sort -n file2)
(Update: the solution above is wrong.
Thanks to L~R for warning me about it.
The corrected version is below, which btw finds matches only
if the numbers in the first column match textually,
not only numerically.)
join <(sort -b file1) <(sort -b file2)
And here's a perl solution, dedicated to merlyn.
use warnings; use strict;
use Quantum::Superpositions;
my $s = do { open my $e, "<", "file1" or die 1; any(<$e>); };
{
open my $m, "<", "file2" or die 2;
while(<$m>) { $_=~/(\S+)/ and $1==$s and print; };
}
__END__
Update 2009 sep 2.
See Re^2: Joining two files on common field for a list of other nodes where unix textutils is suggested to merge files.
| [reply] [d/l] [select] |
Re: comparing arrays
by McMahon (Chaplain) on Sep 21, 2004 at 15:27 UTC
|
This is my favorite answer to my favorite question:
List::Compare | [reply] |
|
|
Are you sure List::Compare applies here? The lines are not identical so they will be considered identical by List::Compare.
| [reply] |
Re: comparing arrays
by TedPride (Priest) on Sep 21, 2004 at 21:28 UTC
|
Assuming the lines are in order, as shown above...
open(INPA, $inpa) || die "Can't open $inpa";
open(INPB, $inpb) || die "Can't open $inpb";
my $a = <INPA>; chomp($a);
my $b = <INPB>;
while ($a && $b) {
$b =~ /^(\d+) /;
if ($a < $1) {
$a = <INPA>; chomp($a);
}
elsif ($a == $1) {
print $b;
$a = <INPA>; chomp($a);
$b = <INPB>;
}
else {
$b = <INPB>;
}
}
close(INPA);
close(INPB);
The advantage of this code is it's simple and easy to edit for other formats by changing the regular expression (currently set for one or more digits followed by a space) and comparisons (change to lt, eq, gt for string keys). It also doesn't require huge arrays or hashes. | [reply] [d/l] |
Re: comparing arrays
by graff (Chancellor) on Sep 22, 2004 at 03:01 UTC
|
I need to do this sort of thing (and similar related things) a lot in my work, so I wrote my own command line utility to handle it, and posted it here (cmpcol) at PM.
For the case you cited, the command line would be:
cmpcol -i -l2 file1 file2
where "-i" means "output the intersection of the two files", and "-l2" means "output full lines from file2 for matches". It has lots of other bells and whistles (union or exclusive-or instead of intersection, using other columns in either file instead of the default first column, etc). HTH. | [reply] [d/l] |