Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
What I am trying to do is open the FastA file that has these DNA sequences, read the descriptor from the headers for each sequence and then use the descriptor as a new filename and output the sequences to these files accordingly in preparation for further analyses, (N.B that can be taken care of in due time).
of Course each one of these records is very big that it spans a lot of lines and there are many records that the entire file is >200,000 KBs. I wrote the following program in BioPerl and just to get the program to print the descriptors after '>' takes more than a half hour whereas even if I turned flushing on while writing to a file the program takes forever again so monks tell me what can be a potential cause of this, and what measures do you implement when dealing with huge files. I am running on an HP with 2 GB of RAM and a duo core processor ...#An example of my data file 'Test.Fasta', which has 3 records starting + at '>' >gi|62750809 TGAGCATGGGAGATCTTTCCATCTTCTGAGGTCTTCTTCAATTTCTTTCCTCAGTGTCTTGAAGTTCTTA TTGTACAGATCTTTTACTTGCTTGGTTAATGTCACACCGAGGTATTTTATATTATTTGGGTCTATTATGA >gi|151301097 TTTCTGGCTCCGCGGGCAGCGGGGCCGTGGCGCTCGGACGGTCTGGGATTCGGGCGCCGCCGCGGAACCG GAATAAGAAGGGAGAGCGCCCGGCTCGGTCCTCGGTCTCCACCGCGGCCCGGAAGGAATCCGGGCAGCCT >gi|25266387 TGTGTATGTATATTAATTACATTCATATGTATTCACAACACCTGCCTCAAATCAGGCAGAATGGTCCAGG ATGGAATTAGGGGCAAGCATGAGGTCTTCAGGCTTACTGATTTCTAAGACACAGTAACTTCACTGGTTAG
Here's my code (scaled down to represent the problem)
#!/usr/local/bin/perl use strict; use warnings; use Bio::SeqIO; my $file = "Test.fasta"; my $in = Bio::SeqIO->new( -file => "<$file", -format => 'fasta', -flush=>0 ); while (my $seq=$in->next_seq){ my $desc = $seq->desc; print $desc, "\n" ; }
|
|---|