perl_n00b has asked for the wisdom of the Perl Monks concerning the following question:
and the name of these files should be>"accession #"_"biotype"<br> ^^<br> "sequence"<br> <br>
Im getting tons of error that I think have to deal with local/global variables but I can't figure out, so any help is greatly appreciated!!!use strict; use warnings; # Constants my $genfile = "c:\bemisia_coi.gb"; my $outfile = "$accession_$biotype"; my ($OUT, $IN); my $line; print "Input: $genfile\n"; open($IN, "$genfile") or die "cannot open $genfile\n"; while ($line = <$IN>) { $line = lc $line; #case insensitive my @line = split(/\///,$line); foreach my $i (0..$#line) { my $accession = "/locus\s*([a-z]{8})"; my $biotype = "/biotype: ([a-z]{1})"; my $sequence = "/origin(\*+)\//\"; $sequence =~ s/\s//g; #Removing spaces $sequence =~ s/\d//g; #Removing numbers open($OUT, '>' "$outfile") or die "cannot open $outfile \n"; print "Printing to $outfile \n"; print($OUT ">$accession_$biotype/n^^/n$sequence"); } }
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Spltting Genbank File
by toolic (Bishop) on May 29, 2009 at 00:30 UTC | |
The first error you are probably getting is from this line:
The dollar sign signifies a variable name, and the double quotes are trying to interpolate a variable which you have not declared yet with my. If you really want an output file named $accession_$biotype with literal dollar sign characters in it, you could try using single quotes to prevent interpolation, and hence, to make Perl understand that it is just a string, not a variable name:
But, what I really think you are trying to do is to keep opening a new output file with a different name. In that case, somewhere inside your loops, you would:
or, equivalently:
Update: maybe you are looking for something close to this UNTESTED code:
| [reply] [d/l] [select] |
Re: Spltting Genbank File
by citromatik (Curate) on May 29, 2009 at 07:38 UTC | |
A Genbank file is made of a series of Genbank records separated by a line consisting of "//" You are reading your file line by line: while ($line = <$IN>) and splitting each line in different lines (??) my @line = split(/\///,$line). Obviously, this is not what you want. An easy way to solve this is to read the file record by record by assigning the value "//" to $/ (see perlvar) and then process the record Another possibility is to read the file line by line, and update the $accession, $biotype and $sequence variables, accordingly. But you can be in a problem if a record doesn't have one of them Update: Following the first approach:
Additional comments:
citromatik | [reply] [d/l] [select] |
Re: Spltting Genbank File
by starX (Chaplain) on May 29, 2009 at 00:33 UTC | |
my $genfile = 'c:\bemisia_coi.gb'; and my @line = split(/\/\//,$line); # If you want to split on // and my $sequence = "/origin(\*+)\/\/\"; # as above Good on you for using strict and warnings, though. The specific errors that you're getting will do wonders for helping us understand what specific problems you have. | [reply] [d/l] [select] |
by rovf (Priest) on May 29, 2009 at 10:44 UTC | |
my $genfile = 'c:\bemisia_coi.gb'; I know that your specification is technically correct, but since the OP is new to Perl and programming, I think it would be better to escape the backslash, though it's not strictly necessary here: Getting the habit of escaping backslashes inside single quotes every time, prevents from surprises when we one day have to write strings which are supposed to contain two backslashes in a row (such as UNC pathes on Windows):
-- Ronald Fischer <ynnor@mm.st> | [reply] [d/l] [select] |
Re: Spltting Genbank File
by korpenkraxar (Sexton) on May 29, 2009 at 11:22 UTC | |
Good to see you taking up Perl for bioinformatics! It is really unparalleled for this kind of stuff. Your task is a very typical bioinformatical problem and solving it is a great learning experience. I am a biologist myself and sort of a self-thought perl-buff who learned the language while writing bioinformatics tools. My comment on your post is however not going to be about the code since others have already provided great feedback, but a hint. When it comes to parsing contents in Genbank files in particular, I can tell you from personal experience that there are all these little unpredictable variations from the "norm" (rather than "standard") popping up every here and there that are a pain to catch since they might occur in one accession out of 20.000. The rich EMBL format is usually much simpler to parse if you have a choice among these two. Do *not* waste your time on trying to write a *comprehensive* Genbank parser from scratch unless you really have to. Instead I really really recommend looking into the Bioperl project where they have already implemented a pretty reliable Genbank parser. The main benefit of the Bioperl project, it's comprehensiveness, is however also its main drawback. It can be bewildering to try to get an overview of all the libraries and methods and you will need to be comfortable with or open to learn a little object oriented perl programming to grasp what is going on. I can tell you however that the effort you put into learning a some Bioperl will be well worth it in the end. You can find Bioperl here: BioPerl and some simple examples here: Bioperl Tutorial | [reply] |
by perl_n00b (Acolyte) on May 29, 2009 at 18:54 UTC | |
Here is a sample entry in the genbank file. The biotype entry is after "/note". I thought I would have to do all lower case because a couple of the entries weren't uniform and had uppercase letters; is this not needed?
Here is my updated code that now contains more errors lol.
And here are the errors I get
| [reply] [d/l] [select] |
by citromatik (Curate) on Jun 01, 2009 at 07:28 UTC | |
You have several errors and mis-conceptions here: citromatik | [reply] [d/l] [select] |
Re: Spltting Genbank File
by stajich (Chaplain) on Jul 12, 2009 at 04:41 UTC | |
| [reply] [d/l] |