in reply to Re^3: Read, match string and print
in thread Read, match string and print

Thanks once again. Yes, it is working but it is getting the wrong input and hence prints out incorrectly.
/protein_id="NP_12312" /db_xref="GI:7546536"
Here. in order to get the GI number, I use this matching expression elsif ( /^\s*protein_id\S*\n\s*\Sdb_xref="GI:(\d+)/ ) but it is not matching. Am I doing some fundamental mistake with the matching operators?

Replies are listed 'Best First'.
Re^5: Read, match string and print
by Corion (Patriarch) on Feb 08, 2010 at 09:32 UTC

    In the other code posted here, I see that you're reading the file line by line. You can't match more than one line if you're reading/processing each line separately. You will need to either set a flag or collect all information up to a point where you know that the current set of data has ended (for example because you hit the start of the next gene or EOF), and then process the accumulated data.

      Thanks. You are right. I try to jump between lines using \n, while in fact we are reading the file line by line. I understood what you suggested but it is very difficult for me to implement it. Can you help me please?

        It's not that hard. The process basically is:

        my %info; # here we collect all information # The name and order of the columns we want to print my @columns = qw(qw(gi version cds); sub flush_info { # print out all information: print join '*', @info{@columns}; # and forget the collected information %info = (); }; while (<>) { if (m!GI:(\d+)!) { if ($info{cds}) { # we are in a CDS block $info{gi} = $1; }; } elsif (m!^\s+CDS\s+(.*)!) { # a new gene information has started flush_info(); # now remember the CDS $info{cds} = $1 } elsif (m!^VERSION.*\w:(\d+)! ) { $info{ version } = $1; } else { warn "Ignoring unknown line [$_]\n"; }; }; # Output any leftover information: flush_info();