Re^3: Creating very long strings from text files (DNA sequences)

I tried the code in your original post on my linux box using perl 5.6.1 and 5.8.0. For the same fruitfly file that BrowserUK mentioned, the times were 7.9 seconds and 21.2 seconds. So I am seeing a perl 5.8 slowdown of about 2.5 times, not the 10 times that you are seeing. However, my 5.8.0 version was built with a later C compiler and uses more optimization. My machine is a 2.5GHz Pentium.

It would be interesting to verify that ActiveState perl is so much slower in 5.8 on the exact same machine and configuration.

Perhaps someone knows of a perl for windows that is faster than the one from ActiveState?

UPDATE:
Here is some code that is a bit faster, about 13 seconds in perl5.8 on my machine, and 5 seconds in perl5.6. I think this code works, but it will need more testing to be sure. The basic idea is to work on the whole sequence when possible, such as when removing whitespace. Also, don't use the hash for intermediate results. Instead, use a scratch variable and store it in the hash when the result is complete.

Doing this makes the logic a bit twisted, but you can probably straighten it out with some more thought.

while (<SEQS>) {
  chomp;
  if (/\>\s*(.+)$/) {
    if ($seq_name ne '') {
      $seqs=~ s/\s//g;
      $seqs_hash{$seq_name} = $seqs;
      $seqs='';
    }
    $seq_name = $1;
  } else {
    $seqs .= $_;
  }
}

$seqs=~ s/\s//g;
$seqs_hash{$seq_name} = $seqs if ($seq_name ne '');
[download]

It should work perfectly the first time! - toma

Comment on Re^3: Creating very long strings from text files (DNA sequences) Download Code