UPDATED - Getting data from second file, based on first files contents;

james28909 has asked for the wisdom of the Perl Monks concerning the following question:

Ok, I asked this earlier in the CB, but I think it was to hard to explain there, so im posting this nice question for your viewing pleasure. :) The main reason I post this question to you is not for sheer enjoyment, But to find other ways to do the same thing.

Anyway, let me start off by posting example code:

And the code:

use warnings;
use strict;

my $f1 = 'file1.txt';
my $f2 = 'file2.txt';

my @original;
my @gottem;

open( my $file1, '<', $f1 );

while ( my $line1 = (<$file1>) ) {
    next if ( $line1 =~ /^#/ );
    chomp($line1);

    open( my $file2, '<', $f2 );

    while ( my $line2 = (<$file2>) ) {
        next if ( $line2 =~ /^#/ );
        chomp($line2);
        my ( $leftside, $rightside ) = split( /\s/, $line2, 2 );
        if ( $line1 =~ $leftside ) {
            push( @original, $rightside );
        }
    }
}

# print "$_\n" for(@original);

for (@original) {
    open( my $file_2, '<', $f2 );
    while ( my $line_2 = (<$file_2>) ) {
        my ( $first, $last ) = split( /\s/, $line_2, 2 );
        if ( $last =~ $_ ) {
            push( @gottem, $line_2 );
        }
    }
}

print $_ for (@gottem);
[download]

here is some sample data:

file1.txt

123
456
789
[download]

file2.txt

123 string 1
111 string 1
script should skip this line
222 string 1
333 string 1
456 string 2
444 string 2
it should skip this line as well
555 string 2
666 string 2
789 string 3
777 string 3
also skipping this line too
888 string 3
999 string 3
[download]

Output:

123 string 1
111 string 1
222 string 1
333 string 1
456 string 2
444 string 2
555 string 2
666 string 2
789 string 3
777 string 3
888 string 3
999 string 3
[download]

Thanks for the tips btw :)

What I am trying to do is take each string from file1.txt, and match it in file2.txt which populates @original. Once that is done I loop through @original and compare it back against file2.txt but this time I am comparing the right side of the split to the current line in file2.txt Then a final push to @gottem, just incase I need to use this list somewhere else.

This will allow me to catch all lines that match with 'string 1' or 'string 2' in file2.txt, because those are what I am going for, but there is no way to get the whole string out of file2.txt without some kind of moderate algorithm.

Like I said, I really would love to see some other ways to do this.

If you have any questions on need me to try to clarify/explain it more, just let me know. EDIT: It seems there is indeed an error in the output, there are some additional 'string 2's in there after 'string 5'. I am not sure how they got in there but hopefully you still see my idea, and can expand the horizon :D

I will tinker with it a little more and try to get the correct output, sorry for any confusion.

EDIT: Updated the post with sample data and output is expected.

Comment on UPDATED - Getting data from second file, based on first files contents; Select or Download Code

Replies are listed 'Best First'.
Re: Getting data from second file, based on first files contents; by kcott (Archbishop) on Oct 29, 2015 at 05:28 UTC
G'day james28909, "Anyway, let me start off by posting example code and files:" For future reference, please post a short, representative sample of your data here. I tried to download the zip file you linked to, but $ wget https://dl.dropboxusercontent.com/u/64707444/monks/monks.zip --2015-10-29 15:29:58-- https://dl.dropboxusercontent.com/u/64707444/ +monks/monks.zip Resolving dl.dropboxusercontent.com (dl.dropboxusercontent.com)... 199 +.47.217.101 Connecting to dl.dropboxusercontent.com (dl.dropboxusercontent.com)\|19 +9.47.217.101\|:443... connected. ERROR: The certificate of `dl.dropboxusercontent.com' is not trusted. ERROR: The certificate of `dl.dropboxusercontent.com' hasn't got a kno +wn issuer. [download] [Perhaps I could've tried harder to get this but I don't really have the time and I shouldn't have to, anyway.] Here's some tips on the code you presented. When opening files, always check for problems. Either use the autodie pragma or hand-craft messages (see open for examples). Repeatedly opening files in a loop, and reading their entire contents multiple times, is rarely (if ever) a good idea. I see that you've done this in both a `while` and a `for` loop. Aim to open and read once. If you need to jump around in an opened file, consider seek and tell. When you read "file1" (for the first time), it may be better to store the data in a hash. For example, instead of `push( @original, $rightside );` [download] perhaps something closer to `++$original{$rightside};` [download] You can then lose the "`for (@original) {...}`" loop altogether, and change `if ( $last =~ $_ ) {` [download] to something like `if ($original{$last}) {` [download] Also, your use of a regex match (`$last =~ $_`) seems questionable. I haven't delved too deeply into this, but a straight equality check (`$last eq $_`) looks like it might be a better idea. These suggestions have been intentionally vague. Without any input and only erroneous expected output (you wrote: "EDIT: It seems there is indeed an error in the output"), I am somewhat loathe to attempt to suggest anything more concrete with regards to the actually processing. If you do provide sample input and real expected output, myself (or another monk) might provide a better answer. — Ken	[reply] [d/l] [select]
Re^2: Getting data from second file, based on first files contents; by james28909 (Deacon) on Oct 29, 2015 at 16:42 UTC
here is some sample data: file1.txt `123 456 789` [download] file2.txt `123 string 1 111 string 1 script should skip this line 222 string 1 333 string 1 456 string 2 444 string 2 it should skip this line as well 555 string 2 666 string 2 789 string 3 777 string 3 also skipping this line too 888 string 3 999 string 3` [download] and the stuff that gets extracted from file 2 are based off of file1's contents. iIt takes the data from file 1 and gets the first match it finds in file 2, then gets the right side column and compasres that against the entire file. Output: `123 string 1 111 string 1 222 string 1 333 string 1 456 string 2 444 string 2 555 string 2 666 string 2 789 string 3 777 string 3 888 string 3 999 string 3` [download] Thanks for the tips btw :) Will updated OP	[reply] [d/l] [select]
Re^3: Getting data from second file, based on first files contents; by kcott (Archbishop) on Oct 29, 2015 at 17:43 UTC
The following achieves what you want with just one pass over `file1.txt` and two passes over `file2.txt`. #!/usr/bin/env perl use strict; use warnings; use autodie; my ($ref_file, $data_file) = qw{pm_1146340_file1.txt pm_1146340_file2. +txt}; my (%ref_left, %ref_right, @output); open my $ref_fh, '<', $ref_file; while (<$ref_fh>) { chomp; undef $ref_left{$_}; } close $ref_fh; open my $data_fh, '<', $data_file; while (<$data_fh>) { my ($left, $right) = split ' ', $_, 2; next unless exists $ref_left{$left} and not defined $ref_left{$lef +t}; ++$ref_left{$left}; ++$ref_right{$right}; } seek $data_fh, 0, 0; while (<$data_fh>) { my ($left, $right) = split ' ', $_, 2; next unless $ref_right{$right}; push @output, $_; } close $data_fh; print for @output; [download] Output: `123 string 1 111 string 1 222 string 1 333 string 1 456 string 2 444 string 2 555 string 2 666 string 2 789 string 3 777 string 3 888 string 3 999 string 3` [download] If the data in `file2.txt` is always ordered as shown, i.e. references to `file1.txt` data always appear first, such as `123 string 1 111 string 1` [download] and never as `111 string 1 123 string 1` [download] you'll only need one pass over `file2.txt`. To more fully test your code, I'd completely jumble up `file2.txt` and then add additional records, such as `123 string 4 111 string 4` [download] The output should be the same with no instances of "`string 4`" appearing at all. Update: I took my own advice (re "To more fully test your code, ...") and found a problem. I have fixed this by making changes to the first and second `while` loops. The original code is in the spoiler below. <Reveal this spoiler or all in this thread> — Ken	[reply] [d/l] [select]
Re^4: Getting data from second file, based on first files contents; by james28909 (Deacon) on Oct 30, 2015 at 04:10 UTC
Re^5: Getting data from second file, based on first files contents; by kcott (Archbishop) on Oct 30, 2015 at 07:43 UTC