in reply to Bioinformatics: Slow Parsing of a Fasta File
Be aware. Bio::SeqIO is ludicrously slow! And has been known to be so for a long time.
That's the trouble with O'Woe frameworks. Everything gets buried so deep in a dark, twisty mess of unnecessary subclasses and overzealous overrides, that even when the limitations are obvious and horribly detrimental, no one can see their way through to correcting the problem.
By way of contrast, run against a 200MB, 1,058,202 140-char sequence fasta file, the following runs in just 11 seconds:
#! perl -slw use strict; use Data::Dumper; local $/ = '>'; my @sequences; (undef) = scalar <>; my $start = time; while( my $record = <> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $desc = shift @lines; my $seq = join'', @lines; print $desc; } printf STDERR "Took %d seconds\n", time() - $start; __END__ c:\test>fasta test.fasta >nul Took 11 seconds c:\test>dir test.fasta 27/07/2010 22:40 201,116,583 test.fasta c:\test>tail test.fasta CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058200: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058201: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058202: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA
As your sequences are much larger, I ran it against the 163MB, 929 sequence (ave:175k/seq) na_clones.dros.RELEASE2.5 file I have kicking around. It took a whole 7 seconds:
c:\test>fasta \dell\test\fasta\na_clones.dros.RELEASE2.5 BACH50G05 : AC011761, 108350 bases, from X:19. BACH57F14 : AC018478, 103809 bases, from 4:101. BACH59K20 : AC010840, 29516 bases, from 4:101. BACN19N21 : AC010839, 91789 bases, from 4:101. ... BACR48O22 : AC104149, 193714 bases, from X:01. BACR48O23 : AC009888, 168719 bases, from 3R:99. BACR48O24 : AC023722, 191590 bases, from X:01. BACR48P17 : AC012165, 176195 bases, from X:18. BACR49A05 : AC008194, 181438 bases, from X:18. Took 7 seconds
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Bioinformatics: Slow Parsing of a Fasta File
by Anonymous Monk on Jul 28, 2010 at 06:43 UTC | |
by BrowserUk (Patriarch) on Jul 28, 2010 at 08:01 UTC | |
by Anonymous Monk on Jul 28, 2010 at 07:04 UTC | |
|
Re^2: Bioinformatics: Slow Parsing of a Fasta File
by Anonymous Monk on Mar 22, 2011 at 16:35 UTC | |
by BrowserUk (Patriarch) on Mar 23, 2011 at 06:21 UTC | |
|
Re^2: Bioinformatics: Slow Parsing of a Fasta File
by Anonymous Monk on Oct 06, 2011 at 03:04 UTC | |
by BrowserUk (Patriarch) on Oct 06, 2011 at 04:33 UTC |