Read, match string and print

sophix has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks! I am still having difficulties with the basics: reading, matching and printing. I would appreciate if you could help me figure out the mistakes I have been making in the following code. Thanks!

#!/usr/bin/perl
use strict;
use warnings;

my $cds = '';
my $version = '';
my $gi = '';
my @genbank = ();

my $data_file ="/DATA/GenBankFile.gb";
open(INFILE, $data_file) || die("Could not open file!");
@genbank=<INFILE>;
close(INFILE);

foreach my $line (@genbank) {
if( $line =~ /^\/\/\n/ ) { last;}
elsif($line =~ /^VERSION/) { $line =~ s/^VERSION\s*\w*\S\d*\s*\w*\S//;
+ $version = $line;}
elsif($line =~ /^\s*\Sdb_xref="GI/) { chomp($line); $line =~ s/^\s*\Sd
+b_xref="GI://; $gi = $line;}
elsif($line =~ /^\s*CDS/){ chomp($line); $line =~ s/^\s*CDS\s*//; $cds
+ = $line;}

print "$gi"."\t"."$version"."\t"."$cds\n" if (defined $cds and defined
+ $gi and defined $version);
}
[download]

and here is the data I have been reading:

LOCUS       NC_0000230           600020 bp    DNA     linear   CON 21-
+APR-2007
VERSION     NC_000023.10  GI:123456789
     CDS             join(11111..222222,333333..444444)
                     /db_xref="GI:5555555"
     CDS             join(66666..7777777,888888..99999)
                     /db_xref="GI:10101010"
//
[download]

and here is the far-from-the-desired-output:

        
    123456789
    
    123456789
    join(11111..222222,333333..444444)
5555555"    123456789
    join(11111..222222,333333..444444)
5555555"    123456789
    join(66666..7777777,888888..99999)
10101010"    123456789
    join(66666..7777777,888888..99999)
10101010"    123456789
    join(66666..7777777,888888..99999)
[download]

and here is what the output should look like:

5555555 123456789 join(11111..222222,333333..444444)
10101010 123456789 join(66666..7777777,888888..99999)
[download]

Comment on Read, match string and print Select or Download Code

Replies are listed 'Best First'.
Re: Read, match string and print by jwkrahn (Abbot) on Feb 08, 2010 at 07:15 UTC
Your first mistake is assigning values to `$cds`, `$version` and `$gi` and then using the test `defined $cds and defined $gi and defined $version` on values that are always defined. And then your next mistake is not clearing out the values in the variables after you have printed them. What you need is something more like this: `#!/usr/bin/perl use strict; use warnings; my $data_file = '/DATA/GenBankFile.gb'; open INFILE, '<', $data_file or die "Could not open '$data_file' $!"; my ( $cds, $gi, $version ); while ( <INFILE> ) { last if m!//$!; if ( /^VERSION.\w:(\d+)/ ) { $version = $1; } elsif ( /^\s\Sdb_xref="GI:(\d+)/ ) { $gi = $1; } elsif ( /^\sCDS\s(\S+)/ ) { $cds = $1; } if ( defined $cds && defined $gi && defined $version ) { print "$gi\t$version\t$cds\n"; $gi = $cds = undef; } } close INFILE;` [download]	[reply] [d/l] [select]
Re^2: Read, match string and print by sophix (Sexton) on Feb 08, 2010 at 07:37 UTC
Ah, okay. I thought of it as follows. I initialized these variables in the beginning and set them to empty. In the end, I wanted to print if and only if all of these variables are assigned values (which were supposed to be read from the file). Thus, I thought it might work if I tell Perl to expect them as defined. But it did not work. :) Thank you for the code! (It did not occur to me that I should have set them as undef before going for any loop) `5555555 123456789 join(11111..222222,333333..444444) 10101010 123456789 join(66666..7777777,888888..99999)` [download] It fails to put a tab between the very first and second elements. Do you know why this is the case?	[reply] [d/l]
Re^3: Read, match string and print by jwkrahn (Abbot) on Feb 08, 2010 at 09:00 UTC
I don't know. When I tried it, it worked.	[reply]
Re^4: Read, match string and print by sophix (Sexton) on Feb 08, 2010 at 09:26 UTC
Re^5: Read, match string and print by Corion (Patriarch) on Feb 08, 2010 at 09:32 UTC
Some notes below your chosen depth have not been shown here
Re: Read, match string and print by sophix (Sexton) on Feb 08, 2010 at 08:18 UTC
Uff, there is an elegant mistake. In the data; mRNA join(68351..68408,76646..77296) /gene="DEFB125" /product="defensin, beta 125" /note="Derived by automated computational analysi +s using gene prediction method: BestRefseq." /transcript_id="NM_153325.2" /db_xref="GI:76563935" /db_xref="GeneID:245938" /db_xref="HGNC:18105" CDS join(68351..68408,76646..77058) /gene="DEFB125" /note="Derived by automated computational analysi +s using gene prediction method: BestRefseq." /codon_start=1 /product="defensin, beta 125 preproprotein prepro +protein" /protein_id="NP_697020.2" /db_xref="GI:76563936" /db_xref="CCDS:CCDS12989.2" /db_xref="GeneID:245938" /db_xref="HGNC:18105" gene 123252..126392 /gene="DEFB126" /note="Derived by automated computational analysi +s using gene prediction method: BestRefseq." /db_xref="GeneID:81623" /db_xref="HGNC:15900" mRNA join(123252..123327,126056..126392) /gene="DEFB126" /product="defensin, beta 126" /note="Derived by automated computational analysi +s using gene prediction method: BestRefseq." /transcript_id="NM_030931.2" /db_xref="GI:30061484" /db_xref="GeneID:81623" /db_xref="HGNC:15900" CDS join(123270..123327,126056..126333) /gene="DEFB126" /note="Derived by automated computational analysi +s using gene prediction method: BestRefseq." /codon_start=1 /product="defensin, beta 126 preproprotein prepro +protein" /protein_id="NP_112193.1" /db_xref="GI:13624333" /db_xref="CCDS:CCDS12990.1" /db_xref="GeneID:81623" /db_xref="HGNC:15900" [download] In the code above, it does not care whether db_xref="GI comes from mRNA or CDS. But I need to get the GI value for CDS - not mRNA! I thought the following (a slight modification for the matching pattern of CDS=>GI) would fix this problem but it did not. Any suggestion is welcome. #!/usr/bin/perl use strict; use warnings; my $data = '/DATA/GenBankFile.gb'; # GenBank file is located at C:\DAT +A open INFILE, '<', $data or die "Please insert a new coin!\n"; my ( $cds, $gi, $version ); # Declaration of variables while ( <INFILE> ) { last if m!//$!; # Mark the end-of-file if ( /^VERSION.\w:(\d+)/ ) { # Extract GI of the entire file $version = $1; } elsif ( /^\sCDS$s+\S+\n+$\Sdb_xref="GI:(\d+)/ ) { # Extract th +e protein gi $gi = $1; } elsif ( /^\sCDS\s*(\S+)/ ) { # Extract the annotation $cds = $1; } if ( defined $cds && defined $gi && defined $version ) { # Print o +nly when all variables are defined print "$gi\t$version\t$cds\n"; $gi = $cds = undef; # Get ready for the next loop } } close INFILE; [download]	[reply] [d/l] [select]
Re^2: Read, match string and print by BioLion (Curate) on Feb 08, 2010 at 14:58 UTC
This looks like a standard format - ( examples and more ) - genbank? Is there a parser available from BioPerl or cpan? Sorry if someone has already asked this!! Just a something something...	[reply]
Re^3: Read, match string and print by sophix (Sexton) on Feb 10, 2010 at 14:21 UTC
Hi, I think I stumbled on a GenBank oriented parser (especially, quite useful when you are reading a gene sequence) while reading Perl for Bioinformatics. I`ll check it out, and let you know. Nevertheless, I would like to learn how to use flags to mark the beginning and end of a block to be processed. That is, for the example in this thread, I want the code to make the adequate matchings only when encountered a CDS block (which starts with \sCDS and ends with \sgene) but not a mRNA block. Well, I still did not manage to, though. I understood how it works but could not figure out how to implement. :)	[reply]