Bioinformatics: Slow Parsing of a Fasta File

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Bioinformatics: Slow Parsing of a Fasta File by BrowserUk (Patriarch) on Jul 27, 2010 at 22:10 UTC
Be aware. Bio::SeqIO is ludicrously slow! And has been known to be so for a long time. That's the trouble with O'Woe frameworks. Everything gets buried so deep in a dark, twisty mess of unnecessary subclasses and overzealous overrides, that even when the limitations are obvious and horribly detrimental, no one can see their way through to correcting the problem. By way of contrast, run against a 200MB, 1,058,202 140-char sequence fasta file, the following runs in just 11 seconds: #! perl -slw use strict; use Data::Dumper; local $/ = '>'; my @sequences; (undef) = scalar <>; my $start = time; while( my $record = <> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $desc = shift @lines; my $seq = join'', @lines; print $desc; } printf STDERR "Took %d seconds\n", time() - $start; __END__ c:\test>fasta test.fasta >nul Took 11 seconds c:\test>dir test.fasta 27/07/2010 22:40 201,116,583 test.fasta c:\test>tail test.fasta CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058200: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058201: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA >seq1058202: Some other descriptive text here CAGGGCGGTTCATTCGCGGACCTATGGCATCCTGGCACTCAACCGGGACTGCGACCAACAATTTTGTCAA CGCGCCTCAGCGGGGGAGGTCCGTATGACCCCGTCCATTGATTCGAACTGCCTAGTCCCCTGGATGACAA [download] As your sequences are much larger, I ran it against the 163MB, 929 sequence (ave:175k/seq) na_clones.dros.RELEASE2.5 file I have kicking around. It took a whole 7 seconds: `c:\test>fasta \dell\test\fasta\na_clones.dros.RELEASE2.5 BACH50G05 : AC011761, 108350 bases, from X:19. BACH57F14 : AC018478, 103809 bases, from 4:101. BACH59K20 : AC010840, 29516 bases, from 4:101. BACN19N21 : AC010839, 91789 bases, from 4:101. ... BACR48O22 : AC104149, 193714 bases, from X:01. BACR48O23 : AC009888, 168719 bases, from 3R:99. BACR48O24 : AC023722, 191590 bases, from X:01. BACR48P17 : AC012165, 176195 bases, from X:18. BACR49A05 : AC008194, 181438 bases, from X:18. Took 7 seconds` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^2: Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk on Jul 28, 2010 at 06:43 UTC
What does `(undef) = scalar <>;` do? It is pretty clear that this program can be a kick-start though I wanted to extract the `$seq->id` and `$seq->desc` and then work on them a little bit to create a filename for files that each will contain one of these sequences Do you believe that the sequence length can have a performance compromising effect on the the way the Bio::SeqIO does its job? While not wanting to minimize the potential for the `Out of Memory! error` I still think of using a hash whose keys is `$seq->id` and whose values are the sequences data itself and then dumping each one of these into its corresponding folder.	[reply] [d/l] [select]
Re^3: Bioinformatics: Slow Parsing of a Fasta File by BrowserUk (Patriarch) on Jul 28, 2010 at 08:01 UTC
What does (undef) = scalar <>; do? With `$/ = '>';` set, the first read will get just the very first '>' in the file--ie. the first character of the first line--which isn't useful, so the above just discards that. It is pretty clear that this program can be a kick-start though I wanted to extract the $seq->id and $seq->desc and then work on them a little bit to create a filename for files that each will contain one of these sequences It's all there available for whatever you want to do. This, which has a couple of minor changes from the code I benchmarked above, might fulfill your requirements. Though the filenames might be iffy, depending upon what's in the descriptions: `#! perl -slw use strict; use Data::Dumper; local $/ = '>'; my @sequences; (undef) = scalar <>; my $start = time; while( my $record = <> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $desc = shift @lines; ## This is the description my $seq = join "\n", @lines; ## This is the sequence. open my $out, '>', $desc . 'fasta' or warn "$desc.fasta : $!" and +next; print $out ">$desc\n", $seq; } printf STDERR "Took %d seconds\n", time() - $start;` [download] Do you believe that the sequence length can have a performance compromising effect on the the way the Bio::SeqIO does its job? Honestly, I could never work it out. The whole thing is so overcomplicated--from memory it inherits from three (mostly unreleated) base classes, and then returns a object handle from a fourth class that might be any of a dozen other classes--it is neigh impossible to trace statically. The only way to know what code is actually invoked, would be to trace it through at runtime. No wonder no one dare try and fix it. My best guess is that the problems stem from two sources: Every method call traversing through half-a-dozen super-classes that do nothing but laboriously and redundantly, check and re-check the same parameters values at each level on the way in; and do the same thing for the return values on the way out. I don't know for sure, as I never managed to get it to install here so I could trace it through at runtime, but the symptoms of the problems that I read are consistent with it creating and retaining (possibly multiple) copies of every sequence in memory. The code above only ever has one description and one sequence in memory at a time, so memory usage will never be a problem. Unless you have a single sequence that is bigger than your virual memory, in which case you'd be stuffed anyway. While not wanting to minimize the potential for the Out of Memory! error I still think of using a hash whose keys is $seq->id and whose values are the sequences data itself and then dumping each one of these into its corresponding folder. Presumably the "not" above is a typo :) If all you want is to split the file into lots of smaller files, there is no need to store everything in memory before writing it out again. And by doing so, you simply create a problem for the future when your next FASTA file is the full 3GB of the HG. For those occasions when you might want to revisit earlier sequences; or correlate between sequences; or process the sequences in some order other than that in which they appear in the file; then I have a simple tied hash implementation that retains just the offset/length pairs of the sequences read, so that it can quickly re-read individual sequences on demand without filling memory. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^3: Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk on Jul 28, 2010 at 07:04 UTC
Throws away the first line read from magic null filehandle	[reply]
Re^2: Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk on Mar 22, 2011 at 16:35 UTC
">" characters are allowed in fasta description header lines. If there are ">" in the description, they will cause errors for the fasta entry.	[reply]
Re^3: Bioinformatics: Slow Parsing of a Fasta File by BrowserUk (Patriarch) on Mar 23, 2011 at 06:21 UTC
I admit, I didn't think that that was legal. But anyway, the fix is quite trivial and has no affect upon performance: #! perl -slw use strict; use Data::Dump qw[ pp ]; my %sequences; local $/ = '>'; (undef) = scalar <DATA>; ## Discard first delimiter local $/ = "\n>"; while( my $record = <DATA> ) { my @lines = split "\n", $record; pop @lines if $lines[-1] eq '>'; my $id = shift @lines; $sequences{ $id } = join'', @lines; } pp \%sequences; __DATA__ >uc002yje.1 > chr21:13973492-13974491 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.2 > chr21:13974492-13975432 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.3 > chr21:13975431-13976330 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG [download] Produces: C:\test>fasta { "uc002yje.1 > chr21:13973492-13974491" => "cccctgccccaccgcaccctggatt +actgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaacccggccccgcagagc +cctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", "uc002yje.2 > chr21:13974492-13975432" => "cccctgccccaccgcaccctggatt +actgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaaaaaacccggccccgca +gagccctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", "uc002yje.3 > chr21:13975431-13976330" => "cccctgccccaccgcaccctggatt +actgcacgccaagaccctcacctgaacgcgccctacactctggcatgggggaacccggccccgcagagg +gccctggaCTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG", } [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk on Oct 06, 2011 at 03:04 UTC
You linked to Fun with local :-(	[reply]
Re^3: Bioinformatics: Slow Parsing of a Fasta File by BrowserUk (Patriarch) on Oct 06, 2011 at 04:33 UTC
Link above now corrected. Instead of 604932 I apparently typed 64932. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Bioinformatics: Slow Parsing of a Fasta File by almut (Canon) on Jul 27, 2010 at 19:25 UTC
I've never used any BioPerl modules myself (so I cannot say anything about their performance in general), but half an hour for reading ~200MB sounds extremely long indeed. That makes me wonder if the file is maybe not stored locally, i.e. you're doing I/O with some remote share... Could you check if running something functionally similar, but without Bio::SeqIO — in its most simple case `while (<>) { print if /^>/; }` [download] (or `perl -ne "print if /^>/" Test.fasta` as a one-liner) takes similarly long?	[reply] [d/l] [select]
Re^2: Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk on Jul 28, 2010 at 06:38 UTC
The code `perl -ne "print if /^>/" test.fasta` works just quite quickly. I am reading this file from the same folder that the program is in... While not wanting to resort to a regex to extract the sequences and their header information I believe this is going to be my only way through!	[reply] [d/l]
Re: Bioinformatics: Slow Parsing of a Fasta File by erix (Prior) on Jul 27, 2010 at 21:43 UTC
Bioperl has considerable overhead, but your numbers are extreme. Maybe you run an old bioperl? ( perl -MBio::SeqIO -e 'print $Bio::SeqIO::VERSION, "\n";' ) I copied your sequences to a 200+ MB file, and tested with and without bioperl, and with a desktop vs server. The slowest I got was still within 5 minutes. The faster machine ran your program in just over 1 minute, the straight perl version in 2s. (reader.pl is basically your code above). ############## # # 5-year old desktop machine # $ ls -lh Test.fasta -rw-r--r-- 1 aardvark aardvark 234M Jul 27 23:02 Test.fasta $ time perl -ne "print if /^>/" Test.fasta \| wc -l 1572864 real 0m5.709s user 0m4.279s sys 0m1.233s $ time perl ./reader.pl \| wc -l 1572864 real 4m26.494s user 4m8.936s sys 0m1.607s ############## # # server # $ ls -lh Test.fasta -rw-rw-r-- 1 aardvark aardvark 234M Jul 27 22:51 Test.fasta $ time perl -ne "print if /^>/" Test.fasta \| wc -l 1572864 real 0m1.774s user 0m1.607s sys 0m0.214s $ time perl ./reader.pl \| wc -l 1572864 real 1m9.258s user 1m9.012s sys 0m0.229s [download]	[reply] [d/l]
Re^2: Bioinformatics: Slow Parsing of a Fasta File by Anonymous Monk on Jul 28, 2010 at 06:34 UTC
The version that I have is 1.006001, how old is that ? and if its rusting old what is the best way to make an update? `I am testing that right on the laptop itself so there's no issue of a +network hindrance of any sort.` [download]	[reply] [d/l]