Find the first occurance and if not found print the first line

sm2004 has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,
I have a set of data that I'm trying to parse and print to a file with tags. Here is an example output:

QURY: Contig103
SIZE: 264
GINO: 116055070
STID: unnamed protein product [Ostreococc...   
SCOR: 40   
EVAL: 0.003

QURY: Contig228
SIZE: 116
GINO: 162464429
NPNO: NP_001104917.1
STID: eucaryotic initiation factor7 [...   
SCOR: 34   
EVAL: 0.004

QURY: Contig1032
SIZE: 217
GINO: 168018663
NPNO: XP_001761865.1
STID: predicted protein [Physcomitrel...   
SCOR: 34 
EVAL: 2e-004
[download]

I've written a code( possibly a very inefficient one) to extract the information I want from each blast result. If there is a hit with a reference number I would like that line to be selected when the result has multiple blast hits. My script doesn't get the hit with the ref number line when it's not the first hit. For eg: The third record should have the NPNO: XP_001761865.1, but instead it gives:

QURY: Heritiera_allseq.fna.Contig1032
SIZE: 217
GINO: 157349006</br>
STID: unnamed protein product [Vitis vini... 
SCOR: 59   
EVAL: 6e-015
[download]

The script should also write the reference number to another file. This is the code I've written:

#!/usr/bin/perl   
# the infile "query.txt" has multiple blast hits per each query
# the outfile needs to have the first blast hit with the Ref number or
+ the best hit
# the second outfile "reflist.txt" is to get the ref numbers to do bat
+ch entrez
$infile = "query.txt"; #output of blast
open (IN, $infile) or die "can't open file: $!";

$outfile = "besthit.txt";
open (OUT, ">$outfile") or die "can't open file: $!";

$ofile = "reflist.txt";
open (NP, ">$ofile") or die "can't open file: $!";

while (<IN>) {
    if ($_ =~ /^Query/) {
        chomp;
        $query = $_;
        $query =~ s/^Query=\s//;
        
    }
    
    elsif ($_ =~ s/^\s+\((\d+)\sletters\)/$1/) {
        chomp;
        $length = $_;
        
    }
    
    elsif ($_ =~ /\|ref\|/) {
        chomp;
        $np = $_;
        until (/^\s+/) {
        $_ = <IN>;
        $line = join (" ", $line, $_);
        }
        # print OUT "$np\n";
        $id = substr $np, 0, 70;
        $score = substr $np, 70, 6;
        $score =~ s/^\s//;
        $evalue = substr $np, 76, 6;
        @id = split(/\|/, $id);
        $gino = $id[1];
        $npno = $id[3];
        $stid = $id[4];
        $stid =~ s/^\s//;
        print NP "$npno\n";
        print OUT "\nQURY: $query\n";
        print OUT "SIZE: $length\n";
        print OUT "GINO: $gino\n";
        print OUT "NPNO: $npno\n";
        print OUT "STID: $stid\n";
        print OUT "SCOR: $score\n";
        print OUT "EVAL: $evalue\n";
    } #elsif end

    elsif ($_ =~ /^gi/) {
        chomp;
        $np = $_;
        until (/^\s+/) {
        $_ = <IN>;
        $line = join (" ", $line, $_);
        }
        # print OUT "$np\n";
        $id = substr $np, 0, 70;
        $score = substr $np, 70, 6;
        $score =~ s/^\s//;
        $evalue = substr $np, 76, 6;
        @id = split(/\|/, $id);
        $gino = $id[1];
        $stid = $id[4];
        $stid =~ s/^\s//;
        print OUT "\nQURY: $query\n";
        print OUT "SIZE: $length\n";
        print OUT "GINO: $gino\n";
        print OUT "STID: $stid\n";
        print OUT "SCOR: $score\n";
        print OUT "EVAL: $evalue\n";
    } #elsif end
            
} #while end
close IN;
close OUT;
close NP;

</readmore>
[download]

This is an example of the inputfile:

Query= Contig103
         (264 letters)
                                                           Score    E
Sequences producing significant alignments:                      (bits
+) Value

gi|116055070|emb|CAL57466.1| unnamed protein product [Ostreococc...   
+ 40   0.003

  Database: All non-redundant GenBank CDS
  translations+PDB+SwissProt+PIR+PRF excluding environmental samples
S1: 41 (20.4 bits)
S2: 89 (38.9 bits)

Query= Contig228
         (116 letters)

Score    E
Sequences producing significant alignments:                      (bits
+) Value

gi|162464429|ref|NP_001104917.1| eucaryotic initiation factor7 [...   
+ 34   0.004
gi|164449261|gb|ABY56090.1| eukaryotic initiation factor iso4E [...   
+ 40   0.006
gi|170753|gb|AAA34296.1| initiation factor (iso)4F p28 subunit        
+ 33   0.007
gi|547713|sp|Q03389|IF4E2_WHEAT Eukaryotic translation initiatio...   
+ 33   0.007

  Database: All non-redundant GenBank CDS
  translations+PDB+SwissProt+PIR+PRF excluding environmental samples
  
S1: 41 (20.4 bits)
S2: 89 (38.9 bits)

Query= Contig1032
         (217 letters)

Score    E
Sequences producing significant alignments:                      (bits
+) Value

gi|157349006|emb|CAO24084.1| unnamed protein product [Vitis vini...   
+ 59   6e-015
gi|168018663|ref|XP_001761865.1| predicted protein [Physcomitrel...   
+ 34   2e-004
gi|168018829|ref|XP_001761948.1| predicted protein [Physcomitrel...   
+ 34   2e-004

  Database: All non-redundant GenBank CDS
  translations+PDB+SwissProt+PIR+PRF excluding environmental samples
[download]

Any help/suggestions to modify my code to get the ref number even when it's not the first hit is much appreciated.
Thanks.

Comment on Find the first occurance and if not found print the first line Select or Download Code

Replies are listed 'Best First'.
Re: Find the first occurance and if not found print the first line by pc88mxer (Vicar) on Jun 26, 2008 at 00:42 UTC
As I understand what you are doing, the problem with your approach is that in the third Query there are three `gi` lines but the `ref` is in the second `gi` line. Your logic to determine if a sequence of `gi` lines has a `ref` in it is faulty because it will only check the first `gi` line (once you detect a `gi` line you gobble up the rest of them.) Here's an approach which sub-divides the problem into parsing a Query and then processing it: `use Data::Dumper; my $query; while (<IN>) { if (s/^Query=\s//) { my $save = $_; process($query); } $query = { query_line => $save }; } elsif (m/(\d+)\sletters/) { $query->{letters} = $1; } elsif (m/^gi/) { push(@{$query->{gi_lines}}, $_); } } process($query); sub process { print Dumper($_[0]); # for now }` [download] Now you can examine all of the `gi` lines in a query: `sub process { my $query = shift; return unless $query; # first time $query will be undef ... # iterate through all the gi lines for my $gi_line (@{$query->{gi_lines}}) { if ($gi_line =~ m{\\|ref\\|}) { ... ref found ... } } }` [download] This way it should be easier to implement whatever processing logic you need.	[reply] [d/l] [select]
Re: Find the first occurance and if not found print the first line by jethro (Monsignor) on Jun 26, 2008 at 01:14 UTC
If a result has more than one blast hits, these blast hits have the same SIZE, right? In that case you could collect your blast hits in a hash with SIZE as the key. Now an existing hash entry should only be substituted with a blast hit of the same SIZE when the blast hit has an NPNO value. This assumes that there is only one blast hit with an NPNO in any result, but you seem to imply that As values of the hash you could either store an array with the other values or one string with all the lines of a blast hit. Naturally such a hash might fill up your memory when you have huge piles of data. In that case that hash should be stored on disk with something like DBM::Deep. Another possibility would be to write your extracted data as single lines to a second file with the SIZE at the beginning of the line. Then sort the file (on unix as easy as executing the utility 'sort'). Afterwards read the file and select from consecutive lines with the same SIZE the one with a NPNO in it. Here some code to illustrate what I mean (in case you are not used to using hashes) First how to store NPNO-hits: `$results{$length}= "\nQURY: $query\nSIZE: $length\nGINO: $gino\n ...";` [download] And how to store non-NPNO hits: `if ( not exists $result{$length} ) { $results{$length}= "\nQURY: $query\nSIZE: $length\nGINO: $gino\n ... +"; }` [download] After having read it all in, print out the hash: `foreach my $text (values %results) { print OUT $text }` [download] EDIT: Didn't notice that all blast hits of a result are grouped together. My solution still works but the solution of pc88mxer is much better.	[reply] [d/l] [select]
Re: Find the first occurance and if not found print the first line by Tanktalus (Canon) on Jun 26, 2008 at 16:13 UTC
You did evaluate the blast modules on CPAN, right? I'm guessing that one or more of those would already be able to parse the blast records into something you can use easily in perl to generate whatever you need. I'm sure it'd require some major restructuring of your code, but it'd also remove a large portion of the code making it easier to read, follow, and understand six months from now.	[reply]
Re^2: Find the first occurance and if not found print the first line by sm2004 (Acolyte) on Jun 27, 2008 at 21:12 UTC
Thank you all for help with the code and other suggestions. I haven't used CPAN, but will look into it too. Thanks again.	[reply]