Re: Bioinformatics: Slow Parsing of a Fasta File

Be aware. Bio::SeqIO is ludicrously slow! And has been known to be so for a long time.

That's the trouble with O'Woe frameworks. Everything gets buried so deep in a dark, twisty mess of unnecessary subclasses and overzealous overrides, that even when the limitations are obvious and horribly detrimental, no one can see their way through to correcting the problem.

By way of contrast, run against a 200MB, 1,058,202 140-char sequence fasta file, the following runs in just 11 seconds:

#! perl -slw
use strict;
use Data::Dumper;

local $/ = '>';

my @sequences;

(undef) = scalar <>;

my $start = time;
while( my $record = <> ) {
    my @lines = split "\n", $record;
    pop @lines if $lines[-1] eq '>';
    my $desc  = shift @lines;
    my $seq = join'', @lines;

    print $desc;
}
printf STDERR "Took %d seconds\n", time() - $start;

__END__
c:\test>fasta test.fasta >nul
Took 11 seconds

c:\test>dir test.fasta
27/07/2010  22:40       201,116,583 test.fasta

c:\test>tail test.fasta
CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA
>seq1058200: Some other descriptive text here
CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA
CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA
>seq1058201: Some other descriptive text here
CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA
CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA
>seq1058202: Some other descriptive text here
CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA
CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA
[download]

As your sequences are much larger, I ran it against the 163MB, 929 sequence (ave:175k/seq) na_clones.dros.RELEASE2.5 file I have kicking around. It took a whole 7 seconds:

c:\test>fasta  \dell\test\fasta\na_clones.dros.RELEASE2.5
BACH50G05 : AC011761, 108350 bases, from X:19.
BACH57F14 : AC018478, 103809 bases, from 4:101.
BACH59K20 : AC010840, 29516 bases, from 4:101.
BACN19N21 : AC010839, 91789 bases, from 4:101.
...
BACR48O22 : AC104149, 193714 bases, from X:01.
BACR48O23 : AC009888, 168719 bases, from 3R:99.
BACR48O24 : AC023722, 191590 bases, from X:01.
BACR48P17 : AC012165, 176195 bases, from X:18.
BACR49A05 : AC008194, 181438 bases, from X:18.
Took 7 seconds
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

Comment on Re: Bioinformatics: Slow Parsing of a Fasta File Select or Download Code

Replies are listed 'Best First'.
Re^2: Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk on Jul 28, 2010 at 06:43 UTC
What does `(undef) = scalar <>;` do? It is pretty clear that this program can be a kick-start though I wanted to extract the `$seq->id` and `$seq->desc` and then work on them a little bit to create a filename for files that each will contain one of these sequences Do you believe that the sequence length can have a performance compromising effect on the the way the Bio::SeqIO does its job? While not wanting to minimize the potential for the `Out of Memory! error` I still think of using a hash whose keys is `$seq->id` and whose values are the sequences data itself and then dumping each one of these into its corresponding folder.	[reply] [d/l] [select]
Re^3: Bioinformatics: Slow Parsing of a Fasta File by BrowserUk (Patriarch) on Jul 28, 2010 at 08:01 UTC
What does (undef) = scalar <>; do? With `$/ = '>';` set, the first read will get just the very first '>' in the file--ie. the first character of the first line--which isn't useful, so the above just discards that. It is pretty clear that this program can be a kick-start though I wanted to extract the $seq->id and $seq->desc and then work on them a little bit to create a filename for files that each will contain one of these sequences It's all there available for whatever you want to do. This, which has a couple of minor changes from the code I benchmarked above, might fulfill your requirements. Though the filenames might be iffy, depending upon what's in the descriptions: `#! perl -slw use strict; use Data::Dumper; local $/ = '>'; my @sequences; (undef) = scalar <>; my $start = time; while( my $record = <> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $desc = shift @lines; ## This is the description my $seq = join "\n", @lines; ## This is the sequence. open my $out, '>', $desc . 'fasta' or warn "$desc.fasta : $!" and +next; print $out ">$desc\n", $seq; } printf STDERR "Took %d seconds\n", time() - $start;` [download] Do you believe that the sequence length can have a performance compromising effect on the the way the Bio::SeqIO does its job? Honestly, I could never work it out. The whole thing is so overcomplicated--from memory it inherits from three (mostly unreleated) base classes, and then returns a object handle from a fourth class that might be any of a dozen other classes--it is neigh impossible to trace statically. The only way to know what code is actually invoked, would be to trace it through at runtime. No wonder no one dare try and fix it. My best guess is that the problems stem from two sources: Every method call traversing through half-a-dozen super-classes that do nothing but laboriously and redundantly, check and re-check the same parameters values at each level on the way in; and do the same thing for the return values on the way out. I don't know for sure, as I never managed to get it to install here so I could trace it through at runtime, but the symptoms of the problems that I read are consistent with it creating and retaining (possibly multiple) copies of every sequence in memory. The code above only ever has one description and one sequence in memory at a time, so memory usage will never be a problem. Unless you have a single sequence that is bigger than your virual memory, in which case you'd be stuffed anyway. While not wanting to minimize the potential for the Out of Memory! error I still think of using a hash whose keys is $seq->id and whose values are the sequences data itself and then dumping each one of these into its corresponding folder. Presumably the "not" above is a typo :) If all you want is to split the file into lots of smaller files, there is no need to store everything in memory before writing it out again. And by doing so, you simply create a problem for the future when your next FASTA file is the full 3GB of the HG. For those occasions when you might want to revisit earlier sequences; or correlate between sequences; or process the sequences in some order other than that in which they appear in the file; then I have a simple tied hash implementation that retains just the offset/length pairs of the sequences read, so that it can quickly re-read individual sequences on demand without filling memory. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^3: Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk on Jul 28, 2010 at 07:04 UTC
Throws away the first line read from magic null filehandle	[reply]
Re^2: Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk on Mar 22, 2011 at 16:35 UTC
">" characters are allowed in fasta description header lines. If there are ">" in the description, they will cause errors for the fasta entry.	[reply]
Re^3: Bioinformatics: Slow Parsing of a Fasta File by BrowserUk (Patriarch) on Mar 23, 2011 at 06:21 UTC
I admit, I didn't think that that was legal. But anyway, the fix is quite trivial and has no affect upon performance: #! perl -slw use strict; use Data::Dump qw[ pp ]; my %sequences; local $/ = '>'; (undef) = scalar <DATA>; ## Discard first delimiter local $/ = "\n>"; while( my $record = <DATA> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $id = shift @lines; $sequences{ $id } = join'', @lines; } pp \%sequences; __DATA__ >uc002yje.1 > chr21:13973492-13974491 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.2 > chr21:13974492-13975432 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.3 > chr21:13975431-13976330 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG [download] Produces: C:\test>fasta { "uc002yje.1 > chr21:13973492-13974491" => "cccctgccccaccgcaccctggatt +actgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaacccggccccgcagagc +cctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", "uc002yje.2 > chr21:13974492-13975432" => "cccctgccccaccgcaccctggatt +actgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaaaaaacccggccccgca +gagccctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", "uc002yje.3 > chr21:13975431-13976330" => "cccctgccccaccgcaccctggatt +actgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaacccggccccgcagagg +gccctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", } [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk on Oct 06, 2011 at 03:04 UTC
You linked to Fun with local :-(	[reply]
Re^3: Bioinformatics: Slow Parsing of a Fasta File by BrowserUk (Patriarch) on Oct 06, 2011 at 04:33 UTC
Link above now corrected. Instead of 604932 I apparently typed 64932. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]