comment on

Dear monks,
I have a set of data that I'm trying to parse and print to a file with tags. Here is an example output:

QURY: Contig103
SIZE: 264
GINO: 116055070
STID: unnamed protein product [Ostreococc...   
SCOR: 40   
EVAL: 0.003

QURY: Contig228
SIZE: 116
GINO: 162464429
NPNO: NP_001104917.1
STID: eucaryotic initiation factor7 [...   
SCOR: 34   
EVAL: 0.004

QURY: Contig1032
SIZE: 217
GINO: 168018663
NPNO: XP_001761865.1
STID: predicted protein [Physcomitrel...   
SCOR: 34 
EVAL: 2e-004
[download]

I've written a code( possibly a very inefficient one) to extract the information I want from each blast result. If there is a hit with a reference number I would like that line to be selected when the result has multiple blast hits. My script doesn't get the hit with the ref number line when it's not the first hit. For eg: The third record should have the NPNO: XP_001761865.1, but instead it gives:

QURY: Heritiera_allseq.fna.Contig1032
SIZE: 217
GINO: 157349006</br>
STID: unnamed protein product [Vitis vini... 
SCOR: 59   
EVAL: 6e-015
[download]

The script should also write the reference number to another file. This is the code I've written:

#!/usr/bin/perl   
# the infile "query.txt" has multiple blast hits per each query
# the outfile needs to have the first blast hit with the Ref number or
+ the best hit
# the second outfile "reflist.txt" is to get the ref numbers to do bat
+ch entrez
$infile = "query.txt"; #output of blast
open (IN, $infile) or die "can't open file: $!";

$outfile = "besthit.txt";
open (OUT, ">$outfile") or die "can't open file: $!";

$ofile = "reflist.txt";
open (NP, ">$ofile") or die "can't open file: $!";

while (<IN>) {
    if ($_ =~ /^Query/) {
        chomp;
        $query = $_;
        $query =~ s/^Query=\s//;
        
    }
    
    elsif ($_ =~ s/^\s+\((\d+)\sletters\)/$1/) {
        chomp;
        $length = $_;
        
    }
    
    elsif ($_ =~ /\|ref\|/) {
        chomp;
        $np = $_;
        until (/^\s+/) {
        $_ = <IN>;
        $line = join (" ", $line, $_);
        }
        # print OUT "$np\n";
        $id = substr $np, 0, 70;
        $score = substr $np, 70, 6;
        $score =~ s/^\s//;
        $evalue = substr $np, 76, 6;
        @id = split(/\|/, $id);
        $gino = $id[1];
        $npno = $id[3];
        $stid = $id[4];
        $stid =~ s/^\s//;
        print NP "$npno\n";
        print OUT "\nQURY: $query\n";
        print OUT "SIZE: $length\n";
        print OUT "GINO: $gino\n";
        print OUT "NPNO: $npno\n";
        print OUT "STID: $stid\n";
        print OUT "SCOR: $score\n";
        print OUT "EVAL: $evalue\n";
    } #elsif end

    elsif ($_ =~ /^gi/) {
        chomp;
        $np = $_;
        until (/^\s+/) {
        $_ = <IN>;
        $line = join (" ", $line, $_);
        }
        # print OUT "$np\n";
        $id = substr $np, 0, 70;
        $score = substr $np, 70, 6;
        $score =~ s/^\s//;
        $evalue = substr $np, 76, 6;
        @id = split(/\|/, $id);
        $gino = $id[1];
        $stid = $id[4];
        $stid =~ s/^\s//;
        print OUT "\nQURY: $query\n";
        print OUT "SIZE: $length\n";
        print OUT "GINO: $gino\n";
        print OUT "STID: $stid\n";
        print OUT "SCOR: $score\n";
        print OUT "EVAL: $evalue\n";
    } #elsif end
            
} #while end
close IN;
close OUT;
close NP;

</readmore>
[download]

This is an example of the inputfile:

Query= Contig103
         (264 letters)
                                                           Score    E
Sequences producing significant alignments:                      (bits
+) Value

gi|116055070|emb|CAL57466.1| unnamed protein product [Ostreococc...   
+ 40   0.003

  Database: All non-redundant GenBank CDS
  translations+PDB+SwissProt+PIR+PRF excluding environmental samples
S1: 41 (20.4 bits)
S2: 89 (38.9 bits)

Query= Contig228
         (116 letters)

Score    E
Sequences producing significant alignments:                      (bits
+) Value

gi|162464429|ref|NP_001104917.1| eucaryotic initiation factor7 [...   
+ 34   0.004
gi|164449261|gb|ABY56090.1| eukaryotic initiation factor iso4E [...   
+ 40   0.006
gi|170753|gb|AAA34296.1| initiation factor (iso)4F p28 subunit        
+ 33   0.007
gi|547713|sp|Q03389|IF4E2_WHEAT Eukaryotic translation initiatio...   
+ 33   0.007

  Database: All non-redundant GenBank CDS
  translations+PDB+SwissProt+PIR+PRF excluding environmental samples
  
S1: 41 (20.4 bits)
S2: 89 (38.9 bits)

Query= Contig1032
         (217 letters)

Score    E
Sequences producing significant alignments:                      (bits
+) Value

gi|157349006|emb|CAO24084.1| unnamed protein product [Vitis vini...   
+ 59   6e-015
gi|168018663|ref|XP_001761865.1| predicted protein [Physcomitrel...   
+ 34   2e-004
gi|168018829|ref|XP_001761948.1| predicted protein [Physcomitrel...   
+ 34   2e-004

  Database: All non-redundant GenBank CDS
  translations+PDB+SwissProt+PIR+PRF excluding environmental samples
[download]

Any help/suggestions to modify my code to get the ref number even when it's not the first hit is much appreciated.
Thanks.

In reply to Find the first occurance and if not found print the first line by sm2004

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.