chemshifts has asked for the wisdom of the Perl Monks concerning the following question:

Hello, This question was posted and answered on StackOverflow: I have two files that I would like to match based on the first letter of the second column in File1 and the first letter of the third column in File2. For example:

File1 1 H 35 1 C 22 2 H 20 2 C 30 File 2 A 1 HB2 MET 1 A 2 CA MET 1 A 3 HA ASP 2 A 4 CA ASP 2 Output 1 MET HB2 35 1 MET CA 22 2 ASP HA 20 2 ASP CA 30
Below is my script:
#!/usr/bin/perl use strict; use warnings; my %data; open (SHIFTS,"file1.txt") or die; open (PDB, "file2.txt") or die; while (my $line = <PDB>) { chomp $line; my @fields = split(/\t/,$line); $data{$fields[4]} = $fields[2]; } close PDB; while (my $line = <SHIFTS>) { chomp($line); my @columns = split(/\t/,$line); my $value = ($columns[1] =~ m/^.*?([A-Za-z])/ ); } if (my $value = $data{"$_"}) print "$columns[0]\t$fields[3]\t$value\t$data{$value}\n"; close SHIFTS; exit;

I am not sure how to implement the match, my if statement above is not correct. Any advice would be greatly appreciated.

Replies are listed 'Best First'.
Re: Match two files using regex
by stevieb (Canon) on Jun 02, 2015 at 18:01 UTC

    Even though I answered this on StackOverflow, I'll paste my solution here for completeness purposes:

    Here's one way using split() hackery:

    #!/usr/bin/perl use strict; use warnings; my $f1 = 'file1.txt'; my $f2 = 'file2.txt'; my @pdb; open my $pdb_file, '<', $f2 or die "Can't open the PDB file $f2: $!"; while (my $line = <$pdb_file>){ chomp $line; push @pdb, $line; } close $pdb_file; open my $shifts_file, '<', $f1 or die "Can't open the SHIFTS file $f1: $!"; while (my $line = <$shifts_file>){ chomp $line; my $pdb_line = shift @pdb; # - inner split: get the third element from the $pdb_line # - outer split: get the first element (character) from the # result of the inner split my $criteria = (split('', (split('\s+', $pdb_line))[2]))[0]; # - compare the 2nd element of the file1.txt line against # the above split() operations if ((split('\s+', $line))[1] eq $criteria){ print "$pdb_line\n"; } else { print "**** >$pdb_line< doesn't match >$line<\n"; } }

    Files:

    file1.txt (note I changed line two to ensure a non-match worked):

    1 H 35 1 A 22 1 H 20

    file2.txt:

    A 1 HB2 MET 1 A 2 CA MET 1 A 3 HA MET 1

    Output:

    ./app.pl A 1 HB2 MET 1 ****>A 2 CA MET 1< doesn't match >1 A 22< A 3 HA MET 1

    -stevieb

      It was correct and I appreciate you posting it as I got the output I needed however, I would just like to see where I went wrong with my code.

        Gotcha. Just for forward-going, it's always best to state up-front that you've cross-posted and that you're just looking for further advice.

        No harm done. I'll have another look at your original code later on if nobody else gets a chance to debug it.

        Cheers,

        -stevieb

      It was correct and I appreciate you posting it as I got the output I needed however, I would just like to see where I went wrong with my code.

      Actually, it doesn't look correct. The output lines are supposed to begin with a number, not a letter, and the output is missing the values from the third column of the first file, which should be the fourth column of the output.

        I was able to get the correct output by adding these variables to the print command.
Re: Match two files using regex
by stevieb (Canon) on Jun 02, 2015 at 17:49 UTC
Re: Match two files using regex
by Anonymous Monk on Jun 02, 2015 at 18:41 UTC

    It's just a simple one-liner

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1128822 use strict; use warnings; $_ = <<END; # input 1 H 35 1 C 22 2 H 20 2 C 30 A 1 HB2 MET 1 A 2 CA MET 1 A 3 HA ASP 2 A 4 CA ASP 2 END =output wanted 1 MET HB2 35 1 MET CA 22 2 ASP HA 20 2 ASP CA 30 =cut print "$1 $5 $4 $3\n" while /^(\S+)\s+(\w)\s+(\S+)(?=.*\n\n.*^\S+\s+ +\S+\s+(\2..)\s+(\S+)\s+\1)/gms;

    :)

Re: Match two files using regex
by GotToBTru (Prior) on Jun 02, 2015 at 19:21 UTC

    You declare my $value inside the loop, so that variable will cease to exist once the loop exits. You need to move your test inside the loop, and don't re-declare the variable. That's the first problem I see.

    Dum Spiro Spero

      I see, I guess it's the same for the fields variable as well...

        Yep. I strongly suggest you get familiar with the Perl debugger. It will be an enormous help to you as you learn the language. You can inspect the values of variables while the program is running. I frequently use it to try out syntax, especially when dealing with references.

        Dum Spiro Spero
Re: Match two files using regex
by Anonymous Monk on Jun 03, 2015 at 05:23 UTC

    Did you forget to mention that the first column of the first file also has to match the last column of the second file?

      It should, but that was too complicated for me to write in script.

        But it's a requirement...

        The one-liner above does it. Here's an expanded version of my one-liner that I hope makes it easier to understand.

        #!/usr/bin/perl # http://perlmonks.org/?node_id=1128822 use strict; use warnings; $_ = <<END; # input 1 H 35 1 C 22 2 H 20 2 C 30 A 1 HB2 MET 1 A 2 CA MET 1 A 3 HA ASP 2 A 4 CA ASP 2 END =output wanted 1 MET HB2 35 1 MET CA 22 2 ASP HA 20 2 ASP CA 30 =cut #print "$1 $5 $4 $3\n" while /^(\S+)\s+(\w)\s+(\S+)(?=.*\n\n.*^\S+\s ++\S+\s+(\2..)\s+(\S+)\s+\1\b)/gms; # expanded for clarity print "$1 $5 $4 $3\n" while / # match ^ # starting at the start of a line (\S+) # capture first field \s+ # skip whitespace (\w) # capture letter in column 2 \s+ # skip whitespace (\S+) # capture third field (?= # zerowidth positive lookahead .* # skipping to \n\n # the empty line separating first and second file # this guarantees the patterns above this are in the first f +ile # and the patterns below are in the second file .* # skipping to ^ # start of a line in second file \S+ # skip first field (not needed) \s+ # skip whitespace \S+ # skip second field (not needed) \s+ # skip whitespace (\2..) # capture third field if it starts with previously captured +letter (three wide) \s+ # skip whitespace (\S+) # capture fourth field \s+ # skip whitespace \1 # make sure fifth field matches first field of first file. \b # insure complete match ) # end of zerowidth lookahead /gmsx; # global, match any start of line, . matches \n, expanded __END__