Re^3: Getting data from second file, based on first files contents;

The following achieves what you want with just one pass over file1.txt and two passes over file2.txt.

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;

my ($ref_file, $data_file) = qw{pm_1146340_file1.txt pm_1146340_file2.
+txt};
my (%ref_left, %ref_right, @output);

open my $ref_fh, '<', $ref_file;
while (<$ref_fh>) {
    chomp;
    undef $ref_left{$_};
}
close $ref_fh;

open my $data_fh, '<', $data_file;
while (<$data_fh>) {
    my ($left, $right) = split ' ', $_, 2;
    next unless exists $ref_left{$left} and not defined $ref_left{$lef
+t};
    ++$ref_left{$left};
    ++$ref_right{$right};
}
seek $data_fh, 0, 0;
while (<$data_fh>) {
    my ($left, $right) = split ' ', $_, 2;
    next unless $ref_right{$right}; 
    push @output, $_;
}   
close $data_fh;

print for @output;
[download]

Output:

123 string 1
111 string 1
222 string 1
333 string 1
456 string 2
444 string 2
555 string 2
666 string 2
789 string 3
777 string 3
888 string 3
999 string 3
[download]

If the data in file2.txt is always ordered as shown, i.e. references to file1.txt data always appear first, such as

123 string 1
111 string 1
[download]

and never as

111 string 1
123 string 1
[download]

you'll only need one pass over file2.txt.

To more fully test your code, I'd completely jumble up file2.txt and then add additional records, such as

123 string 4
111 string 4
[download]

The output should be the same with no instances of "string 4" appearing at all.

Update: I took my own advice (re "To more fully test your code, ...") and found a problem. I have fixed this by making changes to the first and second while loops. The original code is in the spoiler below.

while (<$ref_fh>) {
    chomp;
    ++$ref_left{$_};
}
...
while (<$data_fh>) {
    my ($left, $right) = split ' ', $_, 2;
    next unless $ref_left{$left} or $ref_right{$right};
    ++$ref_right{$right};
}
[download]

— Ken

Comment on Re^3: Getting data from second file, based on first files contents; Select or Download Code

Replies are listed 'Best First'.
Re^4: Getting data from second file, based on first files contents; by james28909 (Deacon) on Oct 30, 2015 at 04:10 UTC
yeah it SHOULD match 'string 4' (on all occurences in file2) IF it contains any lines from file1.txt. if you put '123 string 4' inside of file 2, then it should take '123' from file one, and match the same '123' in file2. then you get the value directly to the right of the match (in file2) and compare it with the whole of file2, if the value to the right of '123' is 'string 4' then it will most def need to match if 123 is in file1, which obviously is. in essence you are trying to filter the file2.txt and you could say it could be like a database or something. anyway thanks for post replying :)	[reply]
Re^5: Getting data from second file, based on first files contents; by kcott (Archbishop) on Oct 30, 2015 at 07:43 UTC
I didn't go into details in my Update:, but what I considered to be the problem was that it DID match 'string 4'. I posted the original code with my update. That's probably what you want. — Ken	[reply]