How to do a reciprocal matching statement

ajl412860 has asked for the wisdom of the Perl Monks concerning the following question:

This is part of project I am working on. Here's the main paradigm I am putting Genomes through a blast program, and doing a reciprocal blast hit on the same genomes. So if I do, Genome A against Genome B, then I am doing Genome B to Genome A. Basically I am seeing if the the First Genome A, matches the second Genome A. Example: A.B.blast B.A.blast Seeing wether the A's are identical. If they are identical, the both files names get recorded on a txt file. Now I got the script working, but it either keeps stagnating on a single file within the dual while loops. I have to do this with about 50 other files. Its been really tricky to get this thing going. All that really needs to happen is a simple comparison. Any advice would be greatly appreciated. If their is a simplier way of doing this, then I am all for it. Its just that I am limited to just using Using strict and warnings, so I cant use anything else. I am using Perl 5

Update; here are the inputs from the file,

Genome A to Genome B contents include.

gi|110123922|gb|EC817325.1|EC817325 gi|110095377|gb|EC788780.1|EC788780

gi|110123921|gb|EC817324.1|EC817324 gi|110105430|gb|EC798833.1|EC798833 6

gi|110123920|gb|EC817323.1|EC817323 gi|110106464|gb|EC799867.1|EC799867

In this file the first gi number in the line, represents genome A, and the second gi number in the line represents genome .B

gi|110104773|gb|EC798176.1|EC798176 gi|110119622|gb|EC813025.1|EC813025

In this file the first gi number represents Genome B, and the second gi number represents Genome A.

This is where it really gets confusing

#!/usr/bin/env perl
use strict;
use warnings;

my @a;
my $n = "\n";
@a = glob("*.Recip.blast.top");
my $t = "\t";
foreach my $a (@a){
#  print $a."\n";
  my @b = split(/[.]/,$a);
  #print @b.$n;
  #print $b[0].$n.$b[1].$n.$b[2].$n.$b[3].$n.$b[4].$n.$b[5].$n.$b[6].$
+n.$b[7];
  my $OrginalBlast;
  $OrginalBlast = $b[0].".".$b[1].".".$b[2].".".$b[3].".".$b[4].".".$b
+[5].".".$b[6].".".$b[7];
#  print $OrginalBlast;
#  print $n;

#####ORGN is = AB|||||||


  open(ORGN,"/home/ajl12013/Labwork/Dbfiles/results/$OrginalBlast") ||
+ die;

######RECI is = BA|||||||

  open(RECI,$a) || die;
  open(OUTM,">MatchingGI.txt" ) || die;
  open(OUTNoM,">NoMatchingGI.txt") || die;
  my $ORGN = <ORGN>;
  #The scalar format works here, but for some reason, the <RECI> doesn
+t print, this makes debugging the whole process very difficult and te
+dious.
  my $RECI = <RECI>;
  #print <ORGN>;
  #print "############################################################
+##################################";
  #print $RECI;
  #print <RECI>;
  my %GenomeTable;
my $Genome1A ;
  my $G1A;
  my $data;
  my $G2A;
  my $Genome2A;
  my $l;
  local $/;
  my $key;
$GenomeTable{$OrginalBlast} = $a ;

#print keys %GenomeTable;
#print $n;
#print values %GenomeTable;
#print $n;
#### First while statement for the AB file.
$GenomeTable{$n} = $n;

while(my $G1A = <ORGN>){
  # my @G1A = $G1A =~ m/^[gi]\d[|]$/g;
  #print $G1A;
  #print $G1A[0];
  #foreach my $q (@G1A){
  #  print $q
  #chomp($G1A);
  if ($G1A =~ m/^[gi|]\w/){
  ($Genome1A) = $G1A =~ m/^gi\|\d+/g;
  $GenomeTable{$Genome1A}  = [];
  $GenomeTable{$n};
  #print "These are the keys: ";
  #print  keys %GenomeTable ;
  #print $n;
#my $GenomeAB_ref;
#$GenomeAB_ref->$GenomeTable{$Genome1A};
#print $GenomeAB_ref;
  foreach my $keys (keys %GenomeTable){ print " Keys: $keys $n";}
}
 while($G2A = $RECI ){
    #print $G2A;
    #chomp($G2A);
    if($G2A =~ m/^[gi|]/){
      # print "Here";
       $G2A =~ m/^gi\|\w+\|\w+\|\w+\.\d\|\w+\s(gi\|\d+)/g;
      $Genome2A = $1;
       # ($l) = $G2A =~ m/^gi\|\d+/g;
     #$G2A =~ m/^gi\|\w+\|\w+\|\w+\.\d\|\w+\s(gi\|\d+)/g;
     #print $l;
     #$Genome2A = $1;
       #print $Genome2A;
    $GenomeTable{$Genome1A} = $Genome2A;
    #print "These are the valuesi: ";
    #print  values %GenomeTable   ;
    #print $n;
  foreach my $values (values %GenomeTable){print "Values: $values $n";
+}


  $Genome1A eq $Genome2A ? print OUTM $a.$n : print OUTNoM $a.$n;

  #exit;
}
#exit;
}
#exit;
}
}
[download]

Comment on How to do a reciprocal matching statement Download Code

Replies are listed 'Best First'.
Re: How to do a reciprocal matching statement by graff (Chancellor) on Sep 08, 2015 at 01:44 UTC
What do you think your inner while loop is supposed to do? How many iterations do you think there should be of that loop? Here's how it starts: `while($G2A = $RECI ){` [download] $RECI is a scalar variable containing a string ~~that is all the data~~ read from the file whose name ($a) is the current iterator on the outermost "foreach" loop. The while statement quoted above simply assigns that same string value to $G2A. There is no "break" or "last" statement inside the loop, and nothing to alter the value of $RECI, so if the string is not empty, the while condition will always be true, and you'll be in an infinite loop. (Did you intend to use "==" instead of "="?) I second the advice given above: if you still have a problem after fixing that inner while loop, show us a little input data and what the output is supposed to look like for that data. UPDATE: (Sep. 9): I just happened to take another look at the code (having the perltidy version below really helps!), and I realized a couple more things about your handling of input files: You do "local $/" after an initial read from both ORGN and RECI file handles, so $ORGN and $RECI contain only the initial line of each file (contrary to what I struck out above) You enter your first while loop, `while (my $G1A = <ORGN>)` after doing "local $/", so there will be only one iteration of that loop. I'm guessing that you may have intended the second while loop to start with `while($G2A=<RECI>)` -- but this would likewise have only one iteration, since the input is being read in "slurp" mode.	[reply] [d/l] [select]
Re^2: How to do a reciprocal matching statement by ajl412860 (Novice) on Sep 09, 2015 at 19:51 UTC
To address you post, the reason I am using two different while loops is because this is the only way from what I know, will compare both files that I am looking at. The nested while loops was the only thing that was working. If there is another way of comparing to files in perl, I am open to changing the script to do so. The amount of iterations equates to the amount of data within each file, meaning that once every line of each file has been compared to each other, with the appropriate matching statements, then both while loops should stop, and another iteration of the $a from the list should be implicated, thus starting the whole process all over again. The file names from glob function, have a purpose. The gob files contain the names of genome AB files with in the name of the genome BA file name. So when I do the split function on the $a file I will get both file names. Also to explain why I did a local $/, without it, the whole content of the files wouldn't be upload into the scripts. At one point only the first lines were being compared, so I used the local $/ so that the scalars could have all the content of the files being upload.	[reply]
Re^3: How to do a reciprocal matching statement by graff (Chancellor) on Sep 10, 2015 at 00:20 UTC
Regarding your reply, it seems I should have been more clear about the problems in your code: As written in the OP, your code either never enters the second while loop (because you've read an empty string from an input file) or else it goes into an infinite loop (because if `$G2A = $RECI` returns true, nothing happens to allow exiting the loop). As written, your handing of inputs from data files appears to be misguided; it's very unlikely that it can work the way you intend it to work. Based on your updates to the OP, and your replies to me and other monks, it seems like you don't really have a clear picture of an algorithm that will produce the results you want to get, and so far, we have not gotten from you a clear picture of the inputs and intended outputs. I hope you understand what I'm saying. If not, you can ask for clarification. But more importantly, see if you can provide a fairly simple, direct description of the task you are trying to accomplish: "(A) There are (how many?) different types of input files - here are brief examples of the content for each type: ... (B) Given these inputs, there's this particular type of match that I need to locate: ... (C) Each time I locate this sort of match, there's this particular information that needs to be output: ...." -- or something to that effect. If you can't figure out how to express your task in a simple example, it's much harder for us to help you. But if you can do that, it should help you to get a fresh start on your script. (The OP script is unlikely to be a good starting point.)	[reply] [d/l]
Re^3: How to do a reciprocal matching statement by 1nickt (Canon) on Sep 09, 2015 at 21:46 UTC
You might like to read How do i extract only contigs of interest, from today. To quote choroba from that thread: `Searching for information taken from one file in another file is quite + a common task. The usual solution is to hash the identifiers, then i +terate over the second file and check the hash for the existence of t +he identifier.` [download] The way forward always starts with a minimal test.	[reply] [d/l]
Re: How to do a reciprocal matching statement by 1nickt (Canon) on Sep 08, 2015 at 01:17 UTC
Hi ajl412860, please post a small sample of your input data and what you expect as output. Post the samples inside `<code></code>`tags. You can edit your original post. Here's a couple of things from one section of your program that you could address in the meantime: `foreach my $a (@a){ # print $a."\n"; my @b = split(/[.]/,$a); #print @b.$n; #print $b[0].$n.$b[1].$n.$b[2].$n.$b[3].$n.$b[4].$n.$b[5].$n.$b[6].$ +n.$b[7]; my $OrginalBlast; $OrginalBlast = $b[0].".".$b[1].".".$b[2].".".$b[3].".".$b[4].".".$b +[5].".".$b[6].".".$b +[7];` [download] Give your variables meaningful names. It's false economy to use one-letter variable names; much better to name the variable so that it tells you what it is when you look at it in your code. Also, you shouldn't use `$a` or `$b`, as they are special reserved variable names. You can just declare a variable the first time you use it. No need to do: `my $foo; $foo = 'bar';` [download] . . . just do: `my $foo = 'bar';` [download] If you want to join a bunch of variables together with a certain string in between, use join. As in: `my $str = join '; ', $foo, $bar, $baz, $qux;` [download] You seem to be splitting the line on a dot, then reassembling the parts, again on a dot (for which you should use `join`). That serves no purpose as far as I can see. So you could probably better write the above code as: `foreach my $OriginalBlast (@files) {` [download] . . . I'm afraid as graff points out the rest of your program has similar ineffectual code. If you post your input data and what you want to get out, we'll be able to help you. But you really should read up on some basics, such as opening and reading files, perlintro, perlfaq and so on. You might want to set aside this program for now and spend some time with the most simple program you can write, in a test directory, opening, reading and writing to the most simple files you can create. The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re: How to do a reciprocal matching statement by hippo (Archbishop) on Sep 08, 2015 at 08:28 UTC
I'm constantly amazed that people (try to) work with source code which looks like that. I've done only two things to it: Removed all the commented-out lines which aren't comments with `perl -ni -e 'print unless /^\s#[^#!]/;' 1141288.pl` Run it through perltidy to sort out the indenting The resulting code is here: #!/usr/bin/env perl use strict; use warnings; my @a; my $n = "\n"; @a = glob (".Recip.blast.top"); my $t = "\t"; foreach my $a (@a) { my @b = split (/[.]/, $a); my $OrginalBlast; $OrginalBlast = $b[0] . "." . $b[1] . "." . $b[2] . "." . $b[3] . "." . $b[4] . "." . $b[5] . "." . $b[6] . "." . $b[7]; #####ORGN is = AB\|\|\|\|\|\|\| open (ORGN, "/home/ajl12013/Labwork/Dbfiles/results/$OrginalBlast" +) \|\| die; ######RECI is = BA\|\|\|\|\|\|\| open (RECI, $a) \|\| die; open (OUTM, ">MatchingGI.txt") \|\| die; open (OUTNoM, ">NoMatchingGI.txt") \|\| die; my $ORGN = <ORGN>; my $RECI = <RECI>; my %GenomeTable; my $Genome1A; my $G1A; my $data; my $G2A; my $Genome2A; my $l; local $/; my $key; $GenomeTable{$OrginalBlast} = $a; #### First while statement for the AB file. $GenomeTable{$n} = $n; while (my $G1A = <ORGN>) { if ($G1A =~ m/^[gi\|]\w/) { ($Genome1A) = $G1A =~ m/^gi\\|\d+/g; $GenomeTable{$Genome1A} = []; $GenomeTable{$n}; foreach my $keys (keys %GenomeTable) { print " Keys: $keys + $n"; } } while ($G2A = $RECI) { if ($G2A =~ m/^[gi\|]/) { $G2A =~ m/^gi\\|\w+\\|\w+\\|\w+\.\d\\|\w+\s(gi\\|\d+)/g; $Genome2A = $1; $GenomeTable{$Genome1A} = $Genome2A; foreach my $values (values %GenomeTable) { print "Values: $values $n"; } $Genome1A eq $Genome2A ? print OUTM $a . $n : print OUTNoM $a . $n; } } } } [download] From this the overall structure of the code is much more apparent to me. Now you can think about applying the other excellent suggestions already given by my learned brethren. It would also be useful to know which version of Perl you are using when it comes to actually solving your problem.	[reply] [d/l] [select]
Re: How to do a reciprocal matching statement by GotToBTru (Prior) on Sep 08, 2015 at 13:06 UTC
I am guessing the point of: `my @b = split(/[.]/,$a); my $OrginalBlast; $OrginalBlast = $b[0].".".$b[1].".".$b[2].".".$b[3].".".$b[4].".".$b +[5].".".$b[6].".".$b[7];` [download] is to remove unwanted elements from the end of $a. It must have 9 or more elements separated by periods, and only 8 are wanted. There are better ways to achieve this, among them: `$OriginalBlast = join '.', split(/[.]/,$a)[0..7];` [download] or `($OriginalBlast) = $a =~ m/(([^.]+\.?){8})/;` [download] Dum Spiro Spero	[reply] [d/l] [select]
Re: How to do a reciprocal matching statement by graff (Chancellor) on Sep 10, 2015 at 00:38 UTC
Regarding your update to the OP: please note that you can and should put "code" tags around data samples (and you can abbreviate them to <c> and </c>). Also, you can have two or more separate "code" blocks in one post, like this: Sample data from one type of file: `gi\|110123922\|gb\|EC817325.1\|EC817325 gi\|110095377\|gb\|EC788780.1\|EC78878 +0 gi\|110123921\|gb\|EC817324.1\|EC817324 gi\|110105430\|gb\|EC798833.1\|EC79883 +3 6 gi\|110123920\|gb\|EC817323.1\|EC817323 gi\|110106464\|gb\|EC799867.1\|EC79986 +7` [download] Sample data from another type of file: `gi\|110104773\|gb\|EC798176.1\|EC798176 gi\|110119622\|gb\|EC813025.1\|EC81302 +5` [download] Now, could you please clarify: how many distinct types of input files are there? If more than one, how do you tell them apart? (Are the file names different in some regular way? Is there something different about their format or content?) What sort of matches are you looking for? Now that I've previewed this reply of mine, so I can see your data samples in "code" font, I'm still stuck. You haven't really given a clear enough explanation of what you're trying to do.	[reply] [d/l] [select]
Re: How to do a reciprocal matching statement by 1nickt (Canon) on Sep 10, 2015 at 07:20 UTC
OK, thanks for posting some data. In future, please post your data inside `<code></code>` tags, like code. The following solution uses some lightweight modules to do a lot of the work. Since you are testing scientific results I thought it would be appropriate to use part of Perl's testing framework. Test::Differences compares two data structures to see if they are identical and reports where they differ, if they are not identical. This script writes test results (tests are named for the file they are testing) to a composite log file, and also writes test failure diagnostics (a diff of the two files) to an individual log for each file. (The only thing I don't like is that you get left with zero-byte failure logs if there were no failures). It assumes you want to strip the filenames as shown; change the regexp to suit. It also makes up a directory for the reciprocal files called 'Recip' and for the original blast files called 'Lab' -- change to suit. #!/usr/bin/perl use strict; use warnings; use File::Find::Rule; use Path::Tiny qw/ path /; use Test::More; use Test::Differences; # log of all tests Test::More->builder->output( 'test_results.txt' ); # Get all the files we want to compare my $rule = File::Find::Rule->new; $rule->file->name('.Recip.blast.top'); my @files = $rule->in( 'Recip' ); foreach my $rcp_file ( @files ) { # make a new path for the original (lab results) file and # strip the unwanted string from the end of the filename ( my $org_file = $rcp_file ) =~ s/^Recip/Lab/; $org_file =~ s/.Recip.blast.top//; # designate an individual test failure log ( my $err_log = "test_failure.$org_file.txt" ) =~ s/Lab\///; Test::More->builder->failure_output( $err_log ); # Get the content of the two files, extract the wanted strings # to be compared, and store in arrays my @rcp_lines = path( $rcp_file )->lines({ chomp => 1 }); @rcp_lines = map { join(' ', (split '\\|')[1,5]) } @rcp_lines; my @org_lines = path( $org_file )->lines({ chomp => 1 }); @org_lines = map { join(' ', (split '\\|')[5,1]) } @org_lines; # run the tests eq_or_diff( \@rcp_lines, \@org_lines, $org_file); } done_testing; __END__ [download] My data files for testing: `$ cat Recip/do.re.mi.fa.so.la.ti.1.Recip.blast.top gi\|110123922\|gb\|EC817325.1\|EC817325 gi\|110095377\|gb\|EC788780.1\|EC78878 +0 gi\|110123921\|gb\|EC817324.1\|EC817324 gi\|110105430\|gb\|EC798833.1\|EC79883 +3 6 gi\|110123920\|gb\|EC817323.1\|EC817323 gi\|110106464\|gb\|EC799867.1\|EC79986 +7` [download] `$ cat Recip/do.re.mi.fa.so.la.ti.2.Recip.blast.top gi\|110123922\|gb\|EC817325.1\|EC817325 gi\|110095377\|gb\|EC788780.1\|EC78878 +0 gi\|110123921\|gb\|EC817324.1\|EC817324 gi\|110105430\|gb\|EC798833.1\|EC79883 +3 6 gi\|110123920\|gb\|EC817323.1\|EC817323 gi\|110106464\|gb\|EC799867.1\|EC79986 +7` [download] `$ cat Lab/do.re.mi.fa.so.la.ti.1 gi\|110095377\|gb\|EC788780.1\|EC788780 gi\|110123922\|gb\|EC817325.1\|EC817 +325 gi\|110105430\|gb\|EC798833.1\|EC798833 6 gi\|110123921\|gb\|EC817324.1\|EC817 +324 gi\|110106464\|gb\|EC799867.1\|EC799867 gi\|110123920\|gb\|EC817323.1\|EC817 +323` [download] `$ cat Lab/do.re.mi.fa.so.la.ti.2 gi\|110095377\|gb\|EC788780.1\|EC788780 gi\|110123922\|gb\|EC817325.1\|EC817 +325 gi\|210105430\|gb\|EC798833.1\|EC798833 6 gi\|110123921\|gb\|EC817324.1\|EC817 +324 gi\|110106464\|gb\|EC799867.1\|EC799867 gi\|110123920\|gb\|EC817323.1\|EC817 +323` [download] And the output: $ perl 1141288.pl $ [download] `$ cat test_results.txt ok 1 - Lab/do.re.mi.fa.so.la.ti.1 not ok 2 - Lab/do.re.mi.fa.so.la.ti.2 1..2` [download] $ cat test_failure.do.re.mi.fa.so.la.ti.2.txt # Failed test 'Lab/do.re.mi.fa.so.la.ti.2' # at 1141288.pl line 40. # +----+--------------------------+--------------------------+ # \| Elt\|Got \|Expected \| # +----+--------------------------+--------------------------+ # \| 0\|[ \|[ \| # \| 1\| '110123922 110095377', \| '110123922 110095377', \| # 2\| '110123921 110105430', \| '110123921 210105430', * # \| 3\| '110123920 110106464' \| '110123920 110106464' \| # \| 4\|] \|] \| # +----+--------------------------+--------------------------+ # Looks like you failed 1 test of 2. [download] Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l] [select]