bobychan has asked for the wisdom of the Perl Monks concerning the following question:

I thought this was incredibly easy, but perhaps it's not. I analyze FASTA-formatted DNA sequences - in this case, the 120 MB Drosophila (fruit fly) genome. The details of the file are not particularly important, but for people who are unfamiliar with the FASTA format, the basic structure is like so:
>sequence1_name ATGACTGTTGG...etc. generally some fixed number of letters per line
What I need to do is read the file in and stuff each sequence into a hash where the key is the sequence name and the value is the sequence itself. My problem is that in Perl 5.6.x, the simple approach worked, but in 5.8.x, it takes forever. Note: The following code is slightly simplified for clarity
my $seq_name; my %seqs_hash; while (my $line = <SEQS>) { chomp $line; if ($line =~ m/\>\s*(.+)$/) { $seq_name = $1; } else { $line =~ s/\s//g; $seqs_hash{$seq_name} .= $line; } }
In Perl 5.6.x, this would take approximately 30 seconds - very acceptable. However, in Perl 5.8.4, this is unbelievably slow and a simple timing analysis shows that the line that's causing the problem is the $seqs_hash{$seq_name} .= $line; It gets progressively slower so that for the first 30,000 lines of the sequence file, it's fairly fast, but then it gets slower and slower and slower. I guess what's happening is that more and more memory is being used because new copies of the growing string keep being copied? Any suggestions as to how to recode the above code? I tried the next simplest approach:
my $seq_name; my %seqs_hash; my %temp_hash; while (my $line = <SEQS>) { chomp $line; if ($line =~ m/\>\s*(.+)$/) { $seq_name = $1; } else { $line =~ s/\s//g; push(@{$temp_hash{$seq_name}}, $line); } } foreach my $key (keys %temp_hash) { $seqs_hash{$key} = join("", @{$temp_hash{$key}}); }
But this still takes approximately 5 minutes - still much slower than Perl 5.6.x. Is it easier to just go back to Perl 5.6 or is there a more elegant way to code this algorithm, perhaps involving slurping the entire file and doing something with regular expressions? Thanks much, Bob

Replies are listed 'Best First'.
Re: Creating very long strings from text files (DNA sequences)
by stajich (Chaplain) on Jun 26, 2004 at 02:34 UTC
    You might also try Lincoln Stein's Bio::DB::Fasta as it will allow you to index a bunch of fasta files in a whole directory and treat it like a tied hash (among other things).
Re: Creating very long strings from text files (DNA sequences)
by etcshadow (Priest) on Jun 26, 2004 at 02:57 UTC
    You could try just setting $/ to "\n>". That is, set the "input record separator" to what actually separates your input records. Gotta love perl for its ability not to answer the question, but rather to very simply change the question to a different one (to which the answer is trivial).

    This would look something like this:

    local $/ = "\n>"; my %seqs_hash; while (my $line = <SEQS>) { chomp $line; $line =~ s/^\>?\s*(.+)+$\r?\n// or do { warn "Malformatted sequence\n"; next; }; $seq_name = $1; $line =~ s/\s//gs; $seqs_hash{$seq_name} = $line; }
    So you're only doing one read per sequence (so there's no need for string concatenation at all).
    ------------ :Wq Not an editor command: Wq
      I tried the suggestions of Anonymous Monk and etcshadow. Unfortunately...

      • .= in perl 5.6: 30 seconds
      • store in array, join into string: 5 minutes
      • set $/ to "\n>": 4 minutes
      Sorry about the typo, etcshadow!

      I glanced through the perl58delta, but did not see any architectural changes to memory allocation. Could there be something else causing the slowdown?

      At this point, I've become interested in this issue above and beyond just solving my particular sequence-reading problem. Is this a bug or a feature of perl 5.8...?

        set $/ to "\n": 4 minutes
        Was that a typo? Did you mean: set $/ to "\n>" (with a greater-than sign). Because if you did it with just $/ set to "\n", then you haven't actually changed the important thing.

        Of course, if that's just a typo, then i could see that my "solution" (which I didn't have any data to benchmark with... or a perl 5.8 install to benchmark against) doesn't help significantly... and that's entirely possible, but I'm curious, none-the-less.

        ------------ :Wq Not an editor command: Wq
Re: Creating very long strings from text files (DNA sequences)
by BrowserUk (Patriarch) on Jun 26, 2004 at 04:24 UTC

    I think that the advice to look at using one of the modules available for dealing with FASTA files (which probably use XS or C) is your best best, but if you need/want to use perl code for this, then there are a few things that can speed up the task a little.

    1. Pre-allocate the buffer that you use to accumulate the sequence into.

      Pre-extending a scalar, and the setting it back to the null string ('' not undef), apparently prevents perl from having to allocate and the reallocate bigger and bigger chunks of memory each time as it accumulates the sequence with  .=.

    2. Use a scalar that is allocated outside the loop where the accumulation is done, and then assign the result to the hash.

      This ensures that the preallocated space is re-used each time through and save having to re-preallocate a new one each time.

    3. Use a package global lexical variable, rather than a global.

      It's a small difference but they are faster.

    4. Pre-allocated the hash buckets using keys %hash = nnnn;.

      This can save upto 10% depending on the size of the hash (and maybe the values of the keys?).

      Unfortunately, I can't recommend a value for nnnn above, it isn't clear to me what, if any relationship there is between the number of keys the hash will hold and the number of buckets it will ultimately use. I think this depends upon the values of the keys, but maybe someone will put me straight on that.

      It never gets any quicker by allocating more once the number of buckets (always a power of 2) exceeds the number of keys. My experiments seems to show that it is usually possible to reduce the number of buckets below the number of keys (which saves some memory) without reducing the performance gain. Sometimes, it even improves it.

    Note: These are only observations not guarentees. YMMV. Try running the benchmark below with parameters chosen to match your requirements and then vary things a bit to see what work best for you.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: Creating very long strings from text files (DNA sequences)
by hv (Prior) on Jun 26, 2004 at 01:39 UTC

    I'm surprised to see so great a slowdown from 5.6.x to 5.8.x, but perhaps what's faster than .= can be of some help in starting to get a handle on the problem.

    Hugo

Re: Creating very long strings from text files (DNA sequences)
by Anonymous Monk on Jun 26, 2004 at 01:50 UTC
    If your data file is close to the one below, this code should do it - notice that in the first line, I had some spaces, unlike all the following lines, to make sure it would do what I believe is the form your data is in.

    Hope this helps

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %seq_hash; $/ = '>'; while ( <DATA> ) { if (s/^(\w+)\s\S+.*\n//) { chomp; tr/\n\t //d; $seq_hash{$1} = $_; } } print Dumper(\%seq_hash); __DATA__ >YNL331C sequence of the region upstream from ynl331c CAATATG CGAG GGAC CTACATGTTGAGCATGACAATGAATTCTATTGAAAAAAAGAGTTG GAAGTATATACGAATATAAATAATGTGAAACAAAAGAAGAAAAGTGAATAAAAGGCACTT AAGACGCTATCCAATTGTGTATGAGAAGTGCAAACTCAATTTTTTTGCAAAAGACTTTTC TCAAACCTTTAATTGATGCGTTCCGTTCATACATCAACCGCGTCCCATACGTTTCGAAGA AAATTGCCTATATGTGTTATTTACATACTAGAGATATTTTAATATTGAAACAGTTGTATT TCTATTGTAATTACTATCTAAACGTCTCGTCCCACCGCTGTTGATAAAGCGGTGCCAATA TAATTTATAATCACGCCTGGGTAAATGACTTTTAATTTCTTAAATCATCGAAGTATGCGA AAACAAGAAGTCTTTATTCATAATAAAAAACAAATTCGGTTACTACGACTTTTATATGTC ATTCAATATTTGGTAATAATTTTACAGTATTAACCCTATCCTCATCTGATTCACTCTCTT CTAATTGCATATATTTTCAAATTCGCTTTCAGCGCTGTACAAAACCAGTAAGCAGATCTC GTACGAAGACTCAAATAAGTTGCATTGTTCGTATTCAAGGAAACCGGGGGGCAAAATTTC CAACCATATTTAAGTATGACAATATTTCCAAGTCAAGGATGCATGCTGTTCTTCTCTTCA TTAACTAGCTAACCAATTAGCTGAACGGCTTTGTATTTTACTTAACATATTGTCTATTGC ATAAAAAACCACTATTCAGC >YKL071W sequence of the region upstream from ykl071w ATAATTATTCCTGTTTCTTTAACCTGGTAAAAAAAAGTACAAACACTTAAGCTTTTTGAA ACAGCTTTATTTTGCTTCATTAAATAGCTAGGATAAGAAATCCCTCATCCGAAAGGTTTT GTATCTAACTACCCTAGAGAACATTTGTCCTGATCAGGTTCATTTGGAGTTTATATTTTT TAGAAGCTCAAAGTTTGTTGGACTCATTACCATGGAAGAAAAAAAGAAGATACTACGAAA TATTGGTTTCTCAGGTTAAATAAGGGACACCATTTTCCTATTAGGCTAGTCGAGCTTAGT TCTTCTAATTTCTTCAGATCTTCTATAATTTCCTATCTTCTACCTGATGTGTGCATGATA TATCTATGAGCTCCTGATATTGCTTGTTTTACTTTAGCTTGCATGACTTGCAATAATCTA ATCATATATGTTCCCGATTAATATACTGTGCACAAATTGCAGGACATATAATTTTTCCGT GGATTATATCTTCGATTAACGTCCGCGGGTCTCATAAAAAGCAAACCAACTTCGCAATTC CCTAGAAATACCTCAATAGAAAGTTATTTGTAATGAGATTAGTAATGAGATTAGCAATGA GATTAGTAATGAGATTAGTAATGAGATTAGTAATGTGATTAGTAATGCATAGCGGTATAA ATGGTAGTACTAATAAGTAAGATAGTATACCAGTTATAATAAATAGGCGGCGATGCTTCA AAACTAATTTTTGACGTTTTTAAGAATAAAGCCTTTACCAGTGGCATAAATCAGTAGAAT TCTAAGCAAACAAAGTCGAT
Re: Creating very long strings from text files (DNA sequences)
by BrowserUk (Patriarch) on Jun 26, 2004 at 12:51 UTC

    Not sure if this is useful or not, but it does load this 165MB FASTA file in 1.156 seconds and 165MB of ram using 5.8.3. The iteration performance isn't too bad either.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: Creating very long strings from text files (DNA sequences)
by toma (Vicar) on Jun 26, 2004 at 19:05 UTC
    Perhaps your builds of perl 5.6 and 5.8 were done with different compilers. Look at the ouput from perl -V from both versions of perl and look for differences.

    If you are not able to build perl, you can probably get someone to try your exact code and data on another platform, such as linux. Devel::SmallProf can help track down the differences.

    You may also be running into some sort of system problem such as disk fragmentation.

    It should work perfectly the first time! - toma
      Hi Toma, I'm using ActivePerl, on a Windows XP desktop - Athlon AMD 2500+ with a 7200 RPM 60 GB hard drive and 512 MB PC2700 RAM (that's the one taking a long time with 5.8) versus a Windows XP laptop - Athlon AMD Mobile 2500+ with a 4200 RPM 20 GB hard drive and 768 MB PC2100 (5.6)
      This is perl, v5.6.1 built for MSWin32-x86-multi-thread Binary build 635 provided by ActiveState Corp. Built 15:34:21 Feb 4 2003 This is perl, v5.8.4 built for MSWin32-x86-multi-thread Binary build 810 provided by ActiveState Corp. Built Jun 1 2004 11:52:21

      perl -V shows that optimization flags are the same for both binary builds...

        I tried the code in your original post on my linux box using perl 5.6.1 and 5.8.0. For the same fruitfly file that BrowserUK mentioned, the times were 7.9 seconds and 21.2 seconds. So I am seeing a perl 5.8 slowdown of about 2.5 times, not the 10 times that you are seeing. However, my 5.8.0 version was built with a later C compiler and uses more optimization. My machine is a 2.5GHz Pentium.

        It would be interesting to verify that ActiveState perl is so much slower in 5.8 on the exact same machine and configuration.

        Perhaps someone knows of a perl for windows that is faster than the one from ActiveState?

        UPDATE:
        Here is some code that is a bit faster, about 13 seconds in perl5.8 on my machine, and 5 seconds in perl5.6. I think this code works, but it will need more testing to be sure. The basic idea is to work on the whole sequence when possible, such as when removing whitespace. Also, don't use the hash for intermediate results. Instead, use a scratch variable and store it in the hash when the result is complete.

        Doing this makes the logic a bit twisted, but you can probably straighten it out with some more thought.

        while (<SEQS>) { chomp; if (/\>\s*(.+)$/) { if ($seq_name ne '') { $seqs=~ s/\s//g; $seqs_hash{$seq_name} = $seqs; $seqs=''; } $seq_name = $1; } else { $seqs .= $_; } } $seqs=~ s/\s//g; $seqs_hash{$seq_name} = $seqs if ($seq_name ne '');
        It should work perfectly the first time! - toma

        Perl 5.8.x is slower that perl 5.6.x at string manipulation.

        That's old news. If you only wish to complain about that fair enough, but it is unlikely to change anything.

        You do have the option to follow the 5.6.2 upgrade path, provided you can live without the new functionality that is in 5.8.x. But if you need that functionality, then you have to appreciate that it comes with a cost.

        There are at least 2 replies in this thread that offer practical steps for mitigating your problem?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon