Be aware. Bio::SeqIO is ludicrously slow! And has been known to be so for a long time.
That's the trouble with O'Woe frameworks. Everything gets buried so deep in a dark, twisty mess of unnecessary subclasses and overzealous overrides, that even when the limitations are obvious and horribly detrimental, no one can see their way through to correcting the problem.
By way of contrast, run against a 200MB, 1,058,202 140-char sequence fasta file, the following runs in just 11 seconds:
#! perl -slw use strict; use Data::Dumper; local $/ = '>'; my @sequences; (undef) = scalar <>; my $start = time; while( my $record = <> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $desc = shift @lines; my $seq = join'', @lines; print $desc; } printf STDERR "Took %d seconds\n", time() - $start; __END__ c:\test>fasta test.fasta >nul Took 11 seconds c:\test>dir test.fasta 27/07/2010 22:40 201,116,583 test.fasta c:\test>tail test.fasta CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058200: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058201: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058202: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA
As your sequences are much larger, I ran it against the 163MB, 929 sequence (ave:175k/seq) na_clones.dros.RELEASE2.5 file I have kicking around. It took a whole 7 seconds:
c:\test>fasta \dell\test\fasta\na_clones.dros.RELEASE2.5 BACH50G05 : AC011761, 108350 bases, from X:19. BACH57F14 : AC018478, 103809 bases, from 4:101. BACH59K20 : AC010840, 29516 bases, from 4:101. BACN19N21 : AC010839, 91789 bases, from 4:101. ... BACR48O22 : AC104149, 193714 bases, from X:01. BACR48O23 : AC009888, 168719 bases, from 3R:99. BACR48O24 : AC023722, 191590 bases, from X:01. BACR48P17 : AC012165, 176195 bases, from X:18. BACR49A05 : AC008194, 181438 bases, from X:18. Took 7 seconds
In reply to Re: Bioinformatics: Slow Parsing of a Fasta File
by BrowserUk
in thread Bioinformatics: Slow Parsing of a Fasta File
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |