sm2004 has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,
I have a set of data that I'm trying to parse and print to a file with tags. Here is an example output:
QURY: Contig103 SIZE: 264 GINO: 116055070 STID: unnamed protein product [Ostreococc... SCOR: 40 EVAL: 0.003 QURY: Contig228 SIZE: 116 GINO: 162464429 NPNO: NP_001104917.1 STID: eucaryotic initiation factor7 [... SCOR: 34 EVAL: 0.004 QURY: Contig1032 SIZE: 217 GINO: 168018663 NPNO: XP_001761865.1 STID: predicted protein [Physcomitrel... SCOR: 34 EVAL: 2e-004
I've written a code( possibly a very inefficient one) to extract the information I want from each blast result. If there is a hit with a reference number I would like that line to be selected when the result has multiple blast hits. My script doesn't get the hit with the ref number line when it's not the first hit. For eg: The third record should have the NPNO: XP_001761865.1, but instead it gives:
QURY: Heritiera_allseq.fna.Contig1032 SIZE: 217 GINO: 157349006</br> STID: unnamed protein product [Vitis vini... SCOR: 59 EVAL: 6e-015
The script should also write the reference number to another file. This is the code I've written:
#!/usr/bin/perl # the infile "query.txt" has multiple blast hits per each query # the outfile needs to have the first blast hit with the Ref number or + the best hit # the second outfile "reflist.txt" is to get the ref numbers to do bat +ch entrez $infile = "query.txt"; #output of blast open (IN, $infile) or die "can't open file: $!"; $outfile = "besthit.txt"; open (OUT, ">$outfile") or die "can't open file: $!"; $ofile = "reflist.txt"; open (NP, ">$ofile") or die "can't open file: $!"; while (<IN>) { if ($_ =~ /^Query/) { chomp; $query = $_; $query =~ s/^Query=\s//; } elsif ($_ =~ s/^\s+\((\d+)\sletters\)/$1/) { chomp; $length = $_; } elsif ($_ =~ /\|ref\|/) { chomp; $np = $_; until (/^\s+/) { $_ = <IN>; $line = join (" ", $line, $_); } # print OUT "$np\n"; $id = substr $np, 0, 70; $score = substr $np, 70, 6; $score =~ s/^\s//; $evalue = substr $np, 76, 6; @id = split(/\|/, $id); $gino = $id[1]; $npno = $id[3]; $stid = $id[4]; $stid =~ s/^\s//; print NP "$npno\n"; print OUT "\nQURY: $query\n"; print OUT "SIZE: $length\n"; print OUT "GINO: $gino\n"; print OUT "NPNO: $npno\n"; print OUT "STID: $stid\n"; print OUT "SCOR: $score\n"; print OUT "EVAL: $evalue\n"; } #elsif end elsif ($_ =~ /^gi/) { chomp; $np = $_; until (/^\s+/) { $_ = <IN>; $line = join (" ", $line, $_); } # print OUT "$np\n"; $id = substr $np, 0, 70; $score = substr $np, 70, 6; $score =~ s/^\s//; $evalue = substr $np, 76, 6; @id = split(/\|/, $id); $gino = $id[1]; $stid = $id[4]; $stid =~ s/^\s//; print OUT "\nQURY: $query\n"; print OUT "SIZE: $length\n"; print OUT "GINO: $gino\n"; print OUT "STID: $stid\n"; print OUT "SCOR: $score\n"; print OUT "EVAL: $evalue\n"; } #elsif end } #while end close IN; close OUT; close NP; </readmore>
This is an example of the inputfile:
Query= Contig103 (264 letters) Score E Sequences producing significant alignments: (bits +) Value gi|116055070|emb|CAL57466.1| unnamed protein product [Ostreococc... + 40 0.003 Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples S1: 41 (20.4 bits) S2: 89 (38.9 bits) Query= Contig228 (116 letters) Score E Sequences producing significant alignments: (bits +) Value gi|162464429|ref|NP_001104917.1| eucaryotic initiation factor7 [... + 34 0.004 gi|164449261|gb|ABY56090.1| eukaryotic initiation factor iso4E [... + 40 0.006 gi|170753|gb|AAA34296.1| initiation factor (iso)4F p28 subunit + 33 0.007 gi|547713|sp|Q03389|IF4E2_WHEAT Eukaryotic translation initiatio... + 33 0.007 Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples S1: 41 (20.4 bits) S2: 89 (38.9 bits) Query= Contig1032 (217 letters) Score E Sequences producing significant alignments: (bits +) Value gi|157349006|emb|CAO24084.1| unnamed protein product [Vitis vini... + 59 6e-015 gi|168018663|ref|XP_001761865.1| predicted protein [Physcomitrel... + 34 2e-004 gi|168018829|ref|XP_001761948.1| predicted protein [Physcomitrel... + 34 2e-004 Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples
Any help/suggestions to modify my code to get the ref number even when it's not the first hit is much appreciated.
Thanks.

Replies are listed 'Best First'.
Re: Find the first occurance and if not found print the first line
by pc88mxer (Vicar) on Jun 26, 2008 at 00:42 UTC
    As I understand what you are doing, the problem with your approach is that in the third Query there are three gi lines but the ref is in the second gi line. Your logic to determine if a sequence of gi lines has a ref in it is faulty because it will only check the first gi line (once you detect a gi line you gobble up the rest of them.)

    Here's an approach which sub-divides the problem into parsing a Query and then processing it:

    use Data::Dumper; my $query; while (<IN>) { if (s/^Query=\s*//) { my $save = $_; process($query); } $query = { query_line => $save }; } elsif (m/(\d+)\s*letters/) { $query->{letters} = $1; } elsif (m/^gi/) { push(@{$query->{gi_lines}}, $_); } } process($query); sub process { print Dumper($_[0]); # for now }
    Now you can examine all of the gi lines in a query:
    sub process { my $query = shift; return unless $query; # first time $query will be undef ... # iterate through all the gi lines for my $gi_line (@{$query->{gi_lines}}) { if ($gi_line =~ m{\|ref\|}) { ... ref found ... } } }
    This way it should be easier to implement whatever processing logic you need.
Re: Find the first occurance and if not found print the first line
by jethro (Monsignor) on Jun 26, 2008 at 01:14 UTC
    If a result has more than one blast hits, these blast hits have the same SIZE, right?

    In that case you could collect your blast hits in a hash with SIZE as the key. Now an existing hash entry should only be substituted with a blast hit of the same SIZE when the blast hit has an NPNO value. This assumes that there is only one blast hit with an NPNO in any result, but you seem to imply that

    As values of the hash you could either store an array with the other values or one string with all the lines of a blast hit.

    Naturally such a hash might fill up your memory when you have huge piles of data. In that case that hash should be stored on disk with something like DBM::Deep. Another possibility would be to write your extracted data as single lines to a second file with the SIZE at the beginning of the line. Then sort the file (on unix as easy as executing the utility 'sort'). Afterwards read the file and select from consecutive lines with the same SIZE the one with a NPNO in it.

    Here some code to illustrate what I mean (in case you are not used to using hashes)

    First how to store NPNO-hits:

    $results{$length}= "\nQURY: $query\nSIZE: $length\nGINO: $gino\n ...";
    And how to store non-NPNO hits:
    if ( not exists $result{$length} ) { $results{$length}= "\nQURY: $query\nSIZE: $length\nGINO: $gino\n ... +"; }
    After having read it all in, print out the hash:
    foreach my $text (values %results) { print OUT $text }
    EDIT: Didn't notice that all blast hits of a result are grouped together. My solution still works but the solution of pc88mxer is much better.
Re: Find the first occurance and if not found print the first line
by Tanktalus (Canon) on Jun 26, 2008 at 16:13 UTC

    You did evaluate the blast modules on CPAN, right? I'm guessing that one or more of those would already be able to parse the blast records into something you can use easily in perl to generate whatever you need. I'm sure it'd require some major restructuring of your code, but it'd also remove a large portion of the code making it easier to read, follow, and understand six months from now.

      Thank you all for help with the code and other suggestions. I haven't used CPAN, but will look into it too. Thanks again.