bobychan has asked for the wisdom of the Perl Monks concerning the following question:
What I need to do is read the file in and stuff each sequence into a hash where the key is the sequence name and the value is the sequence itself. My problem is that in Perl 5.6.x, the simple approach worked, but in 5.8.x, it takes forever. Note: The following code is slightly simplified for clarity>sequence1_name ATGACTGTTGG...etc. generally some fixed number of letters per line
In Perl 5.6.x, this would take approximately 30 seconds - very acceptable. However, in Perl 5.8.4, this is unbelievably slow and a simple timing analysis shows that the line that's causing the problem is the $seqs_hash{$seq_name} .= $line; It gets progressively slower so that for the first 30,000 lines of the sequence file, it's fairly fast, but then it gets slower and slower and slower. I guess what's happening is that more and more memory is being used because new copies of the growing string keep being copied? Any suggestions as to how to recode the above code? I tried the next simplest approach:my $seq_name; my %seqs_hash; while (my $line = <SEQS>) { chomp $line; if ($line =~ m/\>\s*(.+)$/) { $seq_name = $1; } else { $line =~ s/\s//g; $seqs_hash{$seq_name} .= $line; } }
But this still takes approximately 5 minutes - still much slower than Perl 5.6.x. Is it easier to just go back to Perl 5.6 or is there a more elegant way to code this algorithm, perhaps involving slurping the entire file and doing something with regular expressions? Thanks much, Bobmy $seq_name; my %seqs_hash; my %temp_hash; while (my $line = <SEQS>) { chomp $line; if ($line =~ m/\>\s*(.+)$/) { $seq_name = $1; } else { $line =~ s/\s//g; push(@{$temp_hash{$seq_name}}, $line); } } foreach my $key (keys %temp_hash) { $seqs_hash{$key} = join("", @{$temp_hash{$key}}); }
|
|---|