Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, If i have a file with one column of values then a second with three columns, i need to check that the value in column one exits in column one of file two. If this is true then print the line line of file 2.
file1 1 2 3 file 2 2 43 56 6 34 56 8 24 48 so 2 43 56 printed
i though of reading both files in to an array then using something like
if ($file1[0] == $file2[0]) {print @file2}

Replies are listed 'Best First'.
Re: comparing arrays
by ikegami (Patriarch) on Sep 21, 2004 at 15:11 UTC
    local *FILE2; open(FILE2, '<', 'file2') or die(...); my %file2 = map { /^(\S+)/; ($1 => $_) } <FILE2>; close(FILE2); local *FILE1; open(FILE1, '<', 'file1') or die(...); while (<FILE1>) { chomp; print($file2{$_}) if (exists($file2{$_})); } close(FILE1);
Re: comparing arrays
by Limbic~Region (Chancellor) on Sep 21, 2004 at 15:34 UTC
    Anonymous Monk,
    Given that I have no idea how large your files are, reading everything into memory might not be feasible. OTOH, looping through file2 as many times as there are entries in file1 may also be too time consuming. I have comprimised by caching the offset in file2.
    #!/usr/bin/perl use strict; use warnings; my $file_1 = $ARGV[0] || 'file1.txt'; my $file_2 = $ARGV[1] || 'file2.txt'; open (FILE1, '<', $file_1) or die "Unable to open $file_1 for reading +: $!"; open (FILE2, '<', $file_2) or die "Unable to open $file_2 for reading +: $!"; my %offset = ( _pos => 0 ); while ( <FILE1> ) { chomp; if ( defined $offset{ $_ } ) { seek FILE2, $offset{ $_ }, 0; print scalar <FILE2>; next; } else { seek FILE2, $offset{_pos}, 0; my $pos = tell FILE2; while ( my $line = <FILE2> ) { my ($col1) = $line =~ /^(\d+)/; $offset{ $col1 } = $pos; $pos = tell FILE2; if ( $col1 eq $_ ) { print $line; $offset{_pos} = $pos; last; } } } }
    This is fully functional and should be a comprimise between speed and memory.

    Cheers - L~R

    Update: Added optimization so that each line from file 2 is read a maximum of 2 times

Re: comparing arrays
by rjbs (Pilgrim) on Sep 21, 2004 at 15:31 UTC
    open(my $master_file, '<', "file1") or die "couldn't open master"; my %valid = map { chomp; ($_ => 1) } <$master_file>; close $master_file; open(my $data_file, '<', "file2") or die "couldn't open data file"; while (<$data_file>) { my ($key) = split /\s/; print if $valid{$key}; } close $data_file;
    We create a hash of good values from the masterfile, for quick lookup. Then we iterate over the lines in the data file, printing them only if the first value is a valid key.
    rjbs
      rjbs,
      Though the AM didn't state it as a requirement, I wanted to point out that your solution does not preserve order.

      Cheers - L~R

Re: comparing arrays
by radiantmatrix (Parson) on Sep 21, 2004 at 15:51 UTC
    Update: I now realize I was unintentionally redundant. I opened the reply form, then got distracted; by the time I submitted, someone else had come up with essentially the same concept. My apologies! On the upside, that does prove that it's a good idea. :)

    Do this, maybe:

    while <$FILE_1> { $file1{$_}=0; } while <$FILE_2> { my ($match_val) = split(/\s+/, $_); #split on whitespace print $_ if defined $file1{$match_val}; }
    Searching hash keys is faster than a linear array search (especially for large constructs). The first loop loads a hash where the keys are the data from file 1 (the values don't matter here). The second loop prints each line in file2 that has a value in its first column that matches a hash key.

    Should be pretty fast, and has the added advantage of not reading all of file 2 into memory.

    --
    $me = rand($hacker{perl});

    All code, unless otherwise noted, is untested
      thanks all for many good and working suggestions
Re: comparing arrays
by ambrus (Abbot) on Sep 21, 2004 at 17:26 UTC

    The simple solutions is using textutils

    join <(sort -n file1) <(sort -n file2)

    (Update: the solution above is wrong. Thanks to L~R for warning me about it. The corrected version is below, which btw finds matches only if the numbers in the first column match textually, not only numerically.)

    join <(sort -b file1) <(sort -b file2)

    And here's a perl solution, dedicated to merlyn.

    use warnings; use strict; use Quantum::Superpositions; my $s = do { open my $e, "<", "file1" or die 1; any(<$e>); }; { open my $m, "<", "file2" or die 2; while(<$m>) { $_=~/(\S+)/ and $1==$s and print; }; } __END__

    Update 2009 sep 2.

    See Re^2: Joining two files on common field for a list of other nodes where unix textutils is suggested to merge files.

Re: comparing arrays
by McMahon (Chaplain) on Sep 21, 2004 at 15:27 UTC
    This is my favorite answer to my favorite question:
    List::Compare
      Are you sure List::Compare applies here? The lines are not identical so they will be considered identical by List::Compare.
Re: comparing arrays
by TedPride (Priest) on Sep 21, 2004 at 21:28 UTC
    Assuming the lines are in order, as shown above...
    open(INPA, $inpa) || die "Can't open $inpa"; open(INPB, $inpb) || die "Can't open $inpb"; my $a = <INPA>; chomp($a); my $b = <INPB>; while ($a && $b) { $b =~ /^(\d+) /; if ($a < $1) { $a = <INPA>; chomp($a); } elsif ($a == $1) { print $b; $a = <INPA>; chomp($a); $b = <INPB>; } else { $b = <INPB>; } } close(INPA); close(INPB);
    The advantage of this code is it's simple and easy to edit for other formats by changing the regular expression (currently set for one or more digits followed by a space) and comparisons (change to lt, eq, gt for string keys). It also doesn't require huge arrays or hashes.
Re: comparing arrays
by graff (Chancellor) on Sep 22, 2004 at 03:01 UTC
    I need to do this sort of thing (and similar related things) a lot in my work, so I wrote my own command line utility to handle it, and posted it here (cmpcol) at PM.

    For the case you cited, the command line would be:

    cmpcol -i -l2 file1 file2
    where "-i" means "output the intersection of the two files", and "-l2" means "output full lines from file2 for matches". It has lots of other bells and whistles (union or exclusive-or instead of intersection, using other columns in either file instead of the default first column, etc). HTH.