sophix has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks! I am still having difficulties with the basics: reading, matching and printing. I would appreciate if you could help me figure out the mistakes I have been making in the following code. Thanks!
#!/usr/bin/perl use strict; use warnings; my $cds = ''; my $version = ''; my $gi = ''; my @genbank = (); my $data_file ="/DATA/GenBankFile.gb"; open(INFILE, $data_file) || die("Could not open file!"); @genbank=<INFILE>; close(INFILE); foreach my $line (@genbank) { if( $line =~ /^\/\/\n/ ) { last;} elsif($line =~ /^VERSION/) { $line =~ s/^VERSION\s*\w*\S\d*\s*\w*\S//; + $version = $line;} elsif($line =~ /^\s*\Sdb_xref="GI/) { chomp($line); $line =~ s/^\s*\Sd +b_xref="GI://; $gi = $line;} elsif($line =~ /^\s*CDS/){ chomp($line); $line =~ s/^\s*CDS\s*//; $cds + = $line;} print "$gi"."\t"."$version"."\t"."$cds\n" if (defined $cds and defined + $gi and defined $version); }
and here is the data I have been reading:
LOCUS NC_0000230 600020 bp DNA linear CON 21- +APR-2007 VERSION NC_000023.10 GI:123456789 CDS join(11111..222222,333333..444444) /db_xref="GI:5555555" CDS join(66666..7777777,888888..99999) /db_xref="GI:10101010" //
and here is the far-from-the-desired-output:
123456789 123456789 join(11111..222222,333333..444444) 5555555" 123456789 join(11111..222222,333333..444444) 5555555" 123456789 join(66666..7777777,888888..99999) 10101010" 123456789 join(66666..7777777,888888..99999) 10101010" 123456789 join(66666..7777777,888888..99999)
and here is what the output should look like:
5555555 123456789 join(11111..222222,333333..444444) 10101010 123456789 join(66666..7777777,888888..99999)

Replies are listed 'Best First'.
Re: Read, match string and print
by jwkrahn (Abbot) on Feb 08, 2010 at 07:15 UTC

    Your first mistake is assigning values to $cds, $version and $gi and then using the test defined $cds and defined $gi and defined $version on values that are always defined.    And then your next mistake is not clearing out the values in the variables after you have printed them.

    What you need is something more like this:

    #!/usr/bin/perl use strict; use warnings; my $data_file = '/DATA/GenBankFile.gb'; open INFILE, '<', $data_file or die "Could not open '$data_file' $!"; my ( $cds, $gi, $version ); while ( <INFILE> ) { last if m!//$!; if ( /^VERSION.*\w:(\d+)/ ) { $version = $1; } elsif ( /^\s*\Sdb_xref="GI:(\d+)/ ) { $gi = $1; } elsif ( /^\s*CDS\s*(\S+)/ ) { $cds = $1; } if ( defined $cds && defined $gi && defined $version ) { print "$gi\t$version\t$cds\n"; $gi = $cds = undef; } } close INFILE;
      Ah, okay. I thought of it as follows. I initialized these variables in the beginning and set them to empty. In the end, I wanted to print if and only if all of these variables are assigned values (which were supposed to be read from the file). Thus, I thought it might work if I tell Perl to expect them as defined. But it did not work. :) Thank you for the code! (It did not occur to me that I should have set them as undef before going for any loop)
      5555555 123456789 join(11111..222222,333333..444444) 10101010 123456789 join(66666..7777777,888888..99999)
      It fails to put a tab between the very first and second elements. Do you know why this is the case?

        I don't know.    When I tried it, it worked.

Re: Read, match string and print
by sophix (Sexton) on Feb 08, 2010 at 08:18 UTC
    Uff, there is an elegant mistake. In the data;
    mRNA join(68351..68408,76646..77296) /gene="DEFB125" /product="defensin, beta 125" /note="Derived by automated computational analysi +s using gene prediction method: BestRefseq." /transcript_id="NM_153325.2" /db_xref="GI:76563935" /db_xref="GeneID:245938" /db_xref="HGNC:18105" CDS join(68351..68408,76646..77058) /gene="DEFB125" /note="Derived by automated computational analysi +s using gene prediction method: BestRefseq." /codon_start=1 /product="defensin, beta 125 preproprotein prepro +protein" /protein_id="NP_697020.2" /db_xref="GI:76563936" /db_xref="CCDS:CCDS12989.2" /db_xref="GeneID:245938" /db_xref="HGNC:18105" gene 123252..126392 /gene="DEFB126" /note="Derived by automated computational analysi +s using gene prediction method: BestRefseq." /db_xref="GeneID:81623" /db_xref="HGNC:15900" mRNA join(123252..123327,126056..126392) /gene="DEFB126" /product="defensin, beta 126" /note="Derived by automated computational analysi +s using gene prediction method: BestRefseq." /transcript_id="NM_030931.2" /db_xref="GI:30061484" /db_xref="GeneID:81623" /db_xref="HGNC:15900" CDS join(123270..123327,126056..126333) /gene="DEFB126" /note="Derived by automated computational analysi +s using gene prediction method: BestRefseq." /codon_start=1 /product="defensin, beta 126 preproprotein prepro +protein" /protein_id="NP_112193.1" /db_xref="GI:13624333" /db_xref="CCDS:CCDS12990.1" /db_xref="GeneID:81623" /db_xref="HGNC:15900"
    In the code above, it does not care whether db_xref="GI comes from mRNA or CDS. But I need to get the GI value for CDS - not mRNA! I thought the following (a slight modification for the matching pattern of CDS=>GI) would fix this problem but it did not. Any suggestion is welcome.
    #!/usr/bin/perl use strict; use warnings; my $data = '/DATA/GenBankFile.gb'; # GenBank file is located at C:\DAT +A open INFILE, '<', $data or die "Please insert a new coin!\n"; my ( $cds, $gi, $version ); # Declaration of variables while ( <INFILE> ) { last if m!//$!; # Mark the end-of-file if ( /^VERSION.*\w:(\d+)/ ) { # Extract GI of the entire file $version = $1; } elsif ( /^\s*CDS\(s+\S+\n+\)*\Sdb_xref="GI:(\d+)/ ) { # Extract th +e protein gi $gi = $1; } elsif ( /^\s*CDS\s*(\S+)/ ) { # Extract the annotation $cds = $1; } if ( defined $cds && defined $gi && defined $version ) { # Print o +nly when all variables are defined print "$gi\t$version\t$cds\n"; $gi = $cds = undef; # Get ready for the next loop } } close INFILE;

      This looks like a standard format - ( examples and more ) - genbank? Is there a parser available from BioPerl or cpan? Sorry if someone has already asked this!!

      Just a something something...
        Hi, I think I stumbled on a GenBank oriented parser (especially, quite useful when you are reading a gene sequence) while reading Perl for Bioinformatics. I`ll check it out, and let you know. Nevertheless, I would like to learn how to use flags to mark the beginning and end of a block to be processed. That is, for the example in this thread, I want the code to make the adequate matchings only when encountered a CDS block (which starts with \s*CDS and ends with \s*gene) but not a mRNA block. Well, I still did not manage to, though. I understood how it works but could not figure out how to implement. :)