Re: Creating very long strings from text files (DNA sequences)

I think that the advice to look at using one of the modules available for dealing with FASTA files (which probably use XS or C) is your best best, but if you need/want to use perl code for this, then there are a few things that can speed up the task a little.

Pre-allocate the buffer that you use to accumulate the sequence into.
Pre-extending a scalar, and the setting it back to the null string ('' not undef), apparently prevents perl from having to allocate and the reallocate bigger and bigger chunks of memory each time as it accumulates the sequence with .=.
Use a scalar that is allocated outside the loop where the accumulation is done, and then assign the result to the hash.
This ensures that the preallocated space is re-used each time through and save having to re-preallocate a new one each time.
Use a package global lexical variable, rather than a global.
It's a small difference but they are faster.
Pre-allocated the hash buckets using keys %hash = nnnn;.
This can save upto 10% depending on the size of the hash (and maybe the values of the keys?).
Unfortunately, I can't recommend a value for nnnn above, it isn't clear to me what, if any relationship there is between the number of keys the hash will hold and the number of buckets it will ultimately use. I think this depends upon the values of the keys, but maybe someone will put me straight on that.
It never gets any quicker by allocating more once the number of buckets (always a power of 2) exceeds the number of keys. My experiments seems to show that it is usually possible to reduce the number of buckets below the number of keys (which saves some memory) without reducing the performance gain. Sometimes, it even improves it.

Note: These are only observations not guarentees. YMMV. Try running the benchmark below with parameters chosen to match your requirements and then vary things a bit to see what work best for you.

#! perl -slw
use strict;
use Benchmark qw[ cmpthese ];

our $K ||= 8;
our $N ||= 1000;
our $M ||= 100;
our $CHUNK ||= 80;

our $chunk = ' ' x $CHUNK;
our $BUFFER = ' ' x ( $M * $CHUNK ); 

cmpthese( -10, {
    std => q[
        my %hash;
        for my $key ( 1 .. $N ) {
            $hash{ $key } = '';
            for my $chunk ( 1 .. $M ) {
                $hash{ $key } .= $chunk;
            }
        }
    ],
    prealloc => q[
        my %hash;
        for my $key ( 1 .. $N ) {
            $hash{ $key } = ' ' x ( $M * $CHUNK );
            $hash{ $key } = '';
            for my $chunk ( 1 .. $M ) {
                $hash{ $key } .= $chunk;
            }
        }
    ],
    gbuf => q[
        my %hash;
        for my $key ( 1 .. $N ) {
            $BUFFER = '';            
            for my $chunk ( 1 .. $M ) {
                $BUFFER .= $chunk;
            }
            $hash{ $key } = $BUFFER;
        }
    ],
    lbuf => q[
        my %hash;
        my $buffer = ' ' x ( $M * $CHUNK );
        for my $key ( 1 .. $N ) {
            $buffer = '';            
            for my $chunk ( 1 .. $M ) {
                $buffer .= $chunk;
            }
            $hash{ $key } = $buffer;
        }
    ],
    gbuf_keys => q[
        my %hash; keys %hash = $K;
        for my $key ( 1 .. $N ) {
            $BUFFER = '';            
            for my $chunk ( 1 .. $M ) {
                $BUFFER .= $chunk;
            }
            $hash{ $key } = $BUFFER;
        }
    ],
    lbuf_keys => q[
        my %hash; keys %hash = $K;
        for my $key ( 1 .. $N ) {
            $buffer = '';            
            for my $chunk ( 1 .. $M ) {
                $buffer .= $chunk;
            }
            $hash{ $key } = $buffer;
        }
    ],
});

__END__
P:\test>369770 -N=10000 -K=16384
             Rate       std  prealloc      lbuf      gbuf lbuf_keys gb
+uf_keys
std       0.858/s        --      -16%      -56%      -56%      -57%   
+   -57%
prealloc   1.03/s       20%        --      -48%      -48%      -48%   
+   -49%
lbuf       1.96/s      128%       91%        --       -0%       -1%   
+    -2%
gbuf       1.96/s      129%       91%        0%        --       -1%   
+    -2%
lbuf_keys  1.98/s      131%       93%        1%        1%        --   
+    -1%
gbuf_keys  2.00/s      133%       95%        2%        2%        1%   
+     --

P:\test>369770 -N=20000 -K=16384
          s/iter       std  prealloc gbuf_keys lbuf_keys      lbuf    
+  gbuf
std         2.36        --      -15%      -50%      -50%      -53%    
+  -53%
prealloc    2.00       18%        --      -41%      -42%      -44%    
+  -45%
gbuf_keys   1.18      100%       70%        --       -1%       -6%    
+   -6%
lbuf_keys   1.17      102%       71%        1%        --       -5%    
+   -6%
lbuf        1.11      112%       80%        6%        5%        --    
+   -1%
gbuf        1.10      114%       81%        7%        6%        1%    
+    --

P:\test>369770 -N=20000 -K=32678
          s/iter       std  prealloc      lbuf      gbuf lbuf_keys gbu
+f_keys
std         2.36        --      -14%      -53%      -54%      -54%    
+  -54%
prealloc    2.02       17%        --      -45%      -46%      -46%    
+  -46%
lbuf        1.11      111%       81%        --       -2%       -2%    
+   -3%
gbuf        1.10      115%       84%        2%        --       -0%    
+   -1%
lbuf_keys   1.09      116%       85%        2%        0%        --    
+   -1%
gbuf_keys   1.08      118%       86%        3%        1%        1%    
+    --

P:\test>369770 -N=10000 -K=16384 -M=200
             Rate       std  prealloc      lbuf lbuf_keys gbuf_keys   
+   gbuf
std       0.435/s        --      -25%      -60%      -60%      -60%   
+   -61%
prealloc  0.578/s       33%        --      -47%      -47%      -47%   
+   -48%
lbuf       1.09/s      152%       89%        --       -0%       -0%   
+    -1%
lbuf_keys  1.10/s      152%       90%        0%        --       -0%   
+    -1%
gbuf_keys  1.10/s      153%       90%        0%        0%        --   
+    -1%
gbuf       1.11/s      154%       91%        1%        1%        1%   
+     --

P:\test>369770 -N=10000 -K=16384 -M=300
            (warning: too few iterations for a reliable count)

          s/iter       std  prealloc      gbuf lbuf_keys gbuf_keys    
+  lbuf
std         3.61        --      -28%      -63%      -64%      -64%    
+  -64%
prealloc    2.59       40%        --      -49%      -49%      -50%    
+  -50%
gbuf        1.32      173%       96%        --       -1%       -1%    
+   -2%
lbuf_keys   1.31      175%       97%        1%        --       -1%    
+   -2%
gbuf_keys   1.30      177%       98%        1%        1%        --    
+   -1%
lbuf        1.29      179%      100%        2%        2%        1%    
+    --
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon

Comment on Re: Creating very long strings from text files (DNA sequences) Select or Download Code