perl_n00b has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys,
I am just learning perl (my 1st computer language) and am having some trouble with a simple script.
I am trying to split a large genbank file into individual sequence files. The individual sequence files need to look like this...
>"accession #"_"biotype"<br> ^^<br> "sequence"<br> <br>
and the name of these files should be
"accession#_biotype.seq"

Here is my script...
use strict; use warnings; # Constants my $genfile = "c:\bemisia_coi.gb"; my $outfile = "$accession_$biotype"; my ($OUT, $IN); my $line; print "Input: $genfile\n"; open($IN, "$genfile") or die "cannot open $genfile\n"; while ($line = <$IN>) { $line = lc $line; #case insensitive my @line = split(/\///,$line); foreach my $i (0..$#line) { my $accession = "/locus\s*([a-z]{8})"; my $biotype = "/biotype: ([a-z]{1})"; my $sequence = "/origin(\*+)\//\"; $sequence =~ s/\s//g; #Removing spaces $sequence =~ s/\d//g; #Removing numbers open($OUT, '>' "$outfile") or die "cannot open $outfile \n"; print "Printing to $outfile \n"; print($OUT ">$accession_$biotype/n^^/n$sequence"); } }
Im getting tons of error that I think have to deal with local/global variables but I can't figure out, so any help is greatly appreciated!!!

Replies are listed 'Best First'.
Re: Spltting Genbank File
by toolic (Bishop) on May 29, 2009 at 00:30 UTC
    Welcome to the Monastery and welcome to Perl!

    The first error you are probably getting is from this line:

    my $outfile = "$accession_$biotype";

    The dollar sign signifies a variable name, and the double quotes are trying to interpolate a variable which you have not declared yet with my. If you really want an output file named $accession_$biotype with literal dollar sign characters in it, you could try using single quotes to prevent interpolation, and hence, to make Perl understand that it is just a string, not a variable name:

    my $outfile = '$accession_$biotype';

    But, what I really think you are trying to do is to keep opening a new output file with a different name. In that case, somewhere inside your loops, you would:

    $outfile = "${accession}_${biotype}";

    or, equivalently:

    $outfile = $accession . '_' . $biotype;

    Update: maybe you are looking for something close to this UNTESTED code:

    use strict; use warnings; my $genfile = "c:\bemisia_coi.gb"; my $outfile; my ($OUT, $IN); print "Input: $genfile\n"; open $IN, '<', $genfile or die "cannot open $genfile: $!\n"; while (<$IN>) { $_ = lc $_; #case insensitive my @tokens = split m{//}; for my $token (@tokens) { my ($accession) = ($token =~ /locus\s*([a-z]{8})/); my ($biotype) = ($token =~ /biotype: ([a-z]{1})/); my ($sequence) = ($token =~ /origin(\*+)\/\//); $sequence =~ s/\s//g; #Removing spaces $sequence =~ s/\d//g; #Removing numbers $outfile = "${accession}_${biotype}"; open $OUT, '>' $outfile or die "cannot open $outfile: $!\n"; print "Printing to $outfile \n"; print $OUT, ">$outfile/n^^/n$sequence"); } }
Re: Spltting Genbank File
by citromatik (Curate) on May 29, 2009 at 07:38 UTC

    A Genbank file is made of a series of Genbank records separated by a line consisting of "//"

    You are reading your file line by line:  while ($line = <$IN>) and splitting each line in different lines (??) my @line = split(/\///,$line). Obviously, this is not what you want.

    An easy way to solve this is to read the file record by record by assigning the value "//" to $/ (see perlvar) and then process the record

    Another possibility is to read the file line by line, and update the $accession, $biotype and $sequence variables, accordingly. But you can be in a problem if a record doesn't have one of them

    Update: Following the first approach:

    use strict; use warnings; $/ = "//"; my $genfile = "c:\bemisia_coi.gb"; open my $ifh, "<", $genfile or die "cannot open $genfile: $!\n"; while (my $chunk = <$ifh>){ last if eof $ifh; my ($accession) = $chunk =~ /LOCUS\s*([A-Z]*\d+)/; my ($biotype) = $chunk =~ /BIOTYPE\s*([A-Z])/; my ($sequence) = $chunk =~ /ORIGIN\s*(.*)$/s; $sequence =~ s/\s|\d//g; my $outfile = "${accession}_${biotype}"; open my $ofh, '>' $outfile or die "cannot open $outfile: $!\n"; print $ofh, ">$accession\n$sequence"; close $ofh; }

    Additional comments:

    • You don't need to lc your input
    • I've never seen a "biotype" field in a Genbank file (I can't find it in the specs either), but I may be wrong, so I conserved the code that tries to capture that field, instead you can see the molecule type in the LOCUS field (DNA|PROT)
    • A typical "locus name" is made of letters (2 or 3) and numbers, you only test for 8 letters
    • The last record of the file is always empty (because the file ends with a "//") line, that is why I use last if eof $ifh;
    • Don't forget to close each output file when it is not needed, a Genbank file could be very large and you can run out of open filehandles, they are a finite resource on every operating system
    • The "ORIGIN" sub-record is multi-line, so you should add the /s modifier to your regexp (see perlre)
    • If you want to scape the "//" in a regular expression, you should escape both slashes: /\/\//
    • $sequence =~ s/\s//g; $sequence =~ s/\d//g; can be written $sequence =~ s/\s|\d//g;

    citromatik

Re: Spltting Genbank File
by starX (Chaplain) on May 29, 2009 at 00:33 UTC
    I and others would be very interested in seeing some of these errors, I'm sure, but right off the top of my head, you might try the following changes:

    my $genfile = 'c:\bemisia_coi.gb';

    and

     my @line = split(/\/\//,$line); # If you want to split on //

    and

    my $sequence = "/origin(\*+)\/\/\"; # as above

    Good on you for using strict and warnings, though. The specific errors that you're getting will do wonders for helping us understand what specific problems you have.

      my $genfile = 'c:\bemisia_coi.gb';

      I know that your specification is technically correct, but since the OP is new to Perl and programming, I think it would be better to escape the backslash, though it's not strictly necessary here:

      my $genfile = 'c:\\bemisia_coi.gb';
      Getting the habit of escaping backslashes inside single quotes every time, prevents from surprises when we one day have to write strings which are supposed to contain two backslashes in a row (such as UNC pathes on Windows):
      # Access UNC path \\myserver\c$\Data $genfile = '\\myserver\c$\Data\x.gb'; # wrong $genfile = '\\\myserver\c$\Data\x.gb'; # correct, but doesn't look nic +e $genfile = '\\\\myserver\\c$\\Data\\x.gb'; # correct, and looks consis +tent

      -- 
      Ronald Fischer <ynnor@mm.st>
Re: Spltting Genbank File
by korpenkraxar (Sexton) on May 29, 2009 at 11:22 UTC
    Hi perl_n00b!

    Good to see you taking up Perl for bioinformatics! It is really unparalleled for this kind of stuff.

    Your task is a very typical bioinformatical problem and solving it is a great learning experience. I am a biologist myself and sort of a self-thought perl-buff who learned the language while writing bioinformatics tools. My comment on your post is however not going to be about the code since others have already provided great feedback, but a hint.

    When it comes to parsing contents in Genbank files in particular, I can tell you from personal experience that there are all these little unpredictable variations from the "norm" (rather than "standard") popping up every here and there that are a pain to catch since they might occur in one accession out of 20.000. The rich EMBL format is usually much simpler to parse if you have a choice among these two.

    Do *not* waste your time on trying to write a *comprehensive* Genbank parser from scratch unless you really have to. Instead I really really recommend looking into the Bioperl project where they have already implemented a pretty reliable Genbank parser. The main benefit of the Bioperl project, it's comprehensiveness, is however also its main drawback. It can be bewildering to try to get an overview of all the libraries and methods and you will need to be comfortable with or open to learn a little object oriented perl programming to grasp what is going on. I can tell you however that the effort you put into learning a some Bioperl will be well worth it in the end.

    You can find Bioperl here:
    BioPerl

    and some simple examples here:
    Bioperl Tutorial
      First off, thanks for all the replies guys! I shouldn't have rushed my posting since I left out some much needed information so sorry about that.
      Here is a sample entry in the genbank file. The biotype entry is after "/note". I thought I would have to do all lower case because a couple of the entries weren't uniform and had uppercase letters; is this not needed?
      LOCUS EU099432 832 bp DNA linear INV 27 +-APR-2009 DEFINITION Bemisia tabaci strain 05-06 cytochrome oxidase subunit I-l +ike (COI) gene, partial sequence; mitochondrial. ACCESSION EU099432 VERSION EU099432.1 GI:158726330 KEYWORDS . SOURCE mitochondrion Bemisia tabaci ORGANISM Bemisia tabaci Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygo +ta; Neoptera; Paraneoptera; Hemiptera; Sternorrhyncha; Aleyrod +iformes; Aleyrodoidea; Aleyrodidae; Aleyrodinae; Bemisia. REFERENCE 1 (bases 1 to 832) AUTHORS Ma,W.H., Li,X.C., Dennehy,T.J., Lei,C.L., Wang,M., Degain, +B.A. and Nichols,R.L. TITLE Utility of MtCOI polymerase chain reaction-restriction fra +gment length polymorphism in differentiating between Q and B whi +tefly Bemisia tabaci biotypes JOURNAL Insect Sci. 16 (2), 107-114 (2009) REFERENCE 2 (bases 1 to 832) AUTHORS Ma,W., Li,X., Degain,B. and Dennehy,T. TITLE Direct Submission JOURNAL Submitted (13-AUG-2007) Entomology, The University of Ariz +ona, 1140 E. 4th St., Tucson, AZ 85721, USA FEATURES Location/Qualifiers source 1..832 /organism="Bemisia tabaci" /organelle="mitochondrion" /mol_type="genomic DNA" /strain="05-06" /db_xref="taxon:7038" /country="USA: Arizona" /note="biotype: B" gene <1..>832 /gene="COI" misc_feature <1..>832 /gene="COI" /note="similar to cytochrome oxidase subunit I" ORIGIN 1 atatgcatgg agtgattttt tggtccccca gaagtaatat ggcagattag tgcat +tggac 61 ttgatttgtt tggtcatcca taaggcaaaa ggcacaatag ggcttcgaag gttta +ttgtt 121 tgacgtcctc atatattcac agttggaata gatgtagata ctcgagctta tttca +cttca 181 gccactataa ttattgctgt tcccacagga attaaaattt ttagttggct tgcta +ctttg 241 ggtggaataa agtctaataa attaaggcct cttggccttt gatttacagg atttt +tattt 301 ttatttacta taggtgggtt aactggaatt attcttggta attcttctgt agatg +tgtgt 361 ctgcatgaca cttattttgt tgttgcacat tttcattatg ttttatcaat aggaa +ttatt 421 tttgctattg taggaggagt tatctattga tttccactaa tcttaggttt aacct +taaat 481 aattatagat tggtgtctca attttatatc atgtttattg gagtaaattt aactt +ttttt 541 cctcaccatt ttcttggttt agggggaatg cctcgtcgat attcagatta tgctg +attgc 601 tatctagtat gaaataaaat ttcttctgcg ggaaggattc tgagtattat ttctg +ttatt 661 tattttttat ttattgtttt agaatccttt cttcttctgc ggttagtaag attta +agctt 721 ggtgtaagta ggcatctaga atgaaagatt aataaaccag ctcttaatca cagtt +ttaaa 781 gagttgtgtt taactttttt tttctaatat ggcagattag ggccccggga aa //

      Here is my updated code that now contains more errors lol.
      # Genbank Splitter # Takes Accession number and biotype as name for new FASTA file # Contents of new FASTA file is corresponding sequence use strict; use warnings; $/ = "//"; # Constants my $genfile = 'c:\bemisia_coi.gb'; my $outfile = "$accession_$biotype"; my ($OUT, $IN); print "Input: $genfile\n"; open my $ifh, "<", $genfile or die "cannot open $genfile: $!\n"; while (my $chunk = <$ifh>){ last if eof $ifh; $chunk = lc $chunk; my ($accession) = $chunk =~ /locus\s*([a-z]{8}); my ($biotype) = $chunk =~ /biotype: ([a-z]{1}); my ($sequence) = $chunk =~ "/origin(\*+)\/\/\"; $sequence =~ s/\s|\d//g; my $outfile = "${accession}_${biotype}"; open my $ofh, '>' $outfile or die "cannot open $outfile: $!\n"; print "Printing to $outfile\n"; print $ofh, ">$accession $biotype\n^^\n$sequence"; close $ofh; }

      And here are the errors I get
      C:\Users\Owner>perl c:\gen_split2.pl Bareword found where operator expected at c:\gen_split2.pl line 22, ne +ar "my ($b iotype) = $chunk =~ /biotype" (Might be a runaway multi-line // string starting on line 21) (Do you need to predeclare my?) Backslash found where operator expected at c:\gen_split2.pl line 22, n +ear "bioty pe\" Unrecognized escape \s passed through at c:\gen_split2.pl line 23. Unrecognized escape \d passed through at c:\gen_split2.pl line 23. Scalar found where operator expected at c:\gen_split2.pl line 25, near + "my $outf ile = "${accession}" (Might be a runaway multi-line "" string starting on line 23) (Do you need to predeclare my?) Bareword found where operator expected at c:\gen_split2.pl line 25, ne +ar "${acce ssion}_" (Missing operator before _?) String found where operator expected at c:\gen_split2.pl line 27, near + "open my $ofh, '>' $outfile or die "" (Might be a runaway multi-line "" string starting on line 25) (Missing semicolon on previous line?) Backslash found where operator expected at c:\gen_split2.pl line 27, n +ear "$!\" (Missing operator before \?) String found where operator expected at c:\gen_split2.pl line 29, near + "print "" (Might be a runaway multi-line "" string starting on line 27) (Missing semicolon on previous line?) Bareword found where operator expected at c:\gen_split2.pl line 29, ne +ar "print "Printing" (Do you need to predeclare print?) Backslash found where operator expected at c:\gen_split2.pl line 29, n +ear "$outf ile\" (Missing operator before \?) String found where operator expected at c:\gen_split2.pl line 31, near + "print $o fh, "" (Might be a runaway multi-line "" string starting on line 29) (Missing semicolon on previous line?) Scalar found where operator expected at c:\gen_split2.pl line 31, near + "$accessi on $biotype" Global symbol "$accession_" requires explicit package name at c:\gen_s +plit2.pl l ine 12. Global symbol "$biotype" requires explicit package name at c:\gen_spli +t2.pl line 12. Global symbol "$biotype" requires explicit package name at c:\gen_spli +t2.pl line 21. syntax error at c:\gen_split2.pl line 22, near "my ($biotype) = $chunk + =~ /bioty pe" Global symbol "$chunk" requires explicit package name at c:\gen_split2 +.pl line 2 3. Global symbol "$sequence" requires explicit package name at c:\gen_spl +it2.pl lin e 23. syntax error at c:\gen_split2.pl line 25, near "my $outfile = "${acces +sion}" Global symbol "$accession" requires explicit package name at c:\gen_sp +lit2.pl li ne 25. Global symbol "$biotype" requires explicit package name at c:\gen_spli +t2.pl line 25. Global symbol "$accession" requires explicit package name at c:\gen_sp +lit2.pl li ne 31. c:\gen_split2.pl has too many errors.

        You have several errors and mis-conceptions here:

        • my ($accession) = $chunk =~ /locus\s*([a-z]{8}); You haven't closed the regexp (the final slash is missing): /locus\s*([a-z]{8})/;
        • /biotype: ([a-z]{1});. Same as before
        • open my $ofh, '>' $outfile or die "cannot open $outfile: $!\n";. When you use the 3-argument open, you should provide 3 arguments separated with commas (the comma after '>' is missing)
        • my $outfile = "$accession_$biotype";. You are trying to use the variables $accession and $biotype before declaring them. Bare in mind that both variables are updated after reading each record, so you should update $outfile (and open the file for writing) after reading each record (inside the while loop) as I told you in the code I posted. Also, bear in mind what toolic told you about using ${accession}_$biotype

        citromatik

Re: Spltting Genbank File
by stajich (Chaplain) on Jul 12, 2009 at 04:41 UTC
    Wow. You are really reinveting the wheel here. The EMBOSS tools can do this as well. Here's a bioperl solution of one new file per sequence in the original file, named by accession number. No one should really waste their time writing a genbank parser unless it is homework... Note that 'genbank' can be substituted below with any of the supported formats (fasta, embl,swissprot,etc).
    use Bio::SeqIO. my $in = Bio::SeqIO->new(-format => 'genbank', -file => $genfile); while(my $seq = $in->next_seq ){ my $acc =$seq->accession_number; my $out = Bio::SeqIO->new(-format => 'genbank',-file=>">$acc.gb"); $out->write_seq($seq); }