comment on

I thought this was incredibly easy, but perhaps it's not. I analyze FASTA-formatted DNA sequences - in this case, the 120 MB Drosophila (fruit fly) genome. The details of the file are not particularly important, but for people who are unfamiliar with the FASTA format, the basic structure is like so:

>sequence1_name
ATGACTGTTGG...etc.
generally some fixed number of letters per line
[download]

What I need to do is read the file in and stuff each sequence into a hash where the key is the sequence name and the value is the sequence itself. My problem is that in Perl 5.6.x, the simple approach worked, but in 5.8.x, it takes forever. Note: The following code is slightly simplified for clarity

my $seq_name;
my %seqs_hash;
while (my $line = <SEQS>) {
  chomp $line;
  if ($line =~ m/\>\s*(.+)$/) {
    $seq_name = $1;
  } else {
    $line =~ s/\s//g;
    $seqs_hash{$seq_name} .= $line;
  }
}
[download]

In Perl 5.6.x, this would take approximately 30 seconds - very acceptable. However, in Perl 5.8.4, this is unbelievably slow and a simple timing analysis shows that the line that's causing the problem is the $seqs_hash{$seq_name} .= $line; It gets progressively slower so that for the first 30,000 lines of the sequence file, it's fairly fast, but then it gets slower and slower and slower. I guess what's happening is that more and more memory is being used because new copies of the growing string keep being copied? Any suggestions as to how to recode the above code? I tried the next simplest approach:

my $seq_name;
my %seqs_hash;
my %temp_hash;
while (my $line = <SEQS>) {
  chomp $line;
  if ($line =~ m/\>\s*(.+)$/) {
    $seq_name = $1;
  } else {
    $line =~ s/\s//g;
    push(@{$temp_hash{$seq_name}}, $line);
  }
}

foreach my $key (keys %temp_hash) {
  $seqs_hash{$key} = join("", @{$temp_hash{$key}});
}
[download]

But this still takes approximately 5 minutes - still much slower than Perl 5.6.x. Is it easier to just go back to Perl 5.6 or is there a more elegant way to code this algorithm, perhaps involving slurping the entire file and doing something with regular expressions? Thanks much, Bob

In reply to Creating very long strings from text files (DNA sequences) by bobychan

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.