Executive summary: 20x speed-up for 200k bp

hdb's solution fixes the the presumed Out of memory! error you were seeing (it kicks in on my desktop somewhere around 50k), but dragging around the whole suffix when you don't need nearly that much string to solve your question is a little silly. Rather than that, you could try grabbing say 10 characters at a time, and then only grabbing more if you need it.

sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
Second, doing 2 calls to substr on each cmp is excessive. You'll run out of memory if you grab the whole suffix (O(n^2) memory usage), but if you cache the bit you need, that should drop to O(m*n). (Found a bug that made caching unreliable; fixing it ruined its performance)

Finally, I note the phrase "bp" in the post, and note that the demo set contains 2 characters; my guess is that this is a bioinformatics question, and checking w/ 4 letters would, if anything, hurt my code. I've therefore also benchmarked w/ rand(4) instead of rand(26). Long story short: 2107% speed-up over hdb's solution. Test script:

use strict; use warnings; use Benchmark ':all'; verify(); comparison(20000,26,20); comparison(20000,4,20); comparison(200000,26,5); comparison(200000,4,5); sub verify { my $input = join '', map { ('a'..'z')[rand(4)] } 0..20000-1; my @RobertCraven = RobertCraven($input); my @hdb = hdb($input); my @kennethk = kennethk($input); die "Mismatch 1" if @RobertCraven != @hdb; die "Mismatch 2" if @RobertCraven != @kennethk; for my $i (0 .. $#RobertCraven) { die "Char mismatch 1($i)" if $RobertCraven[$i] ne $hdb[$i]; die "Char mismatch 2($i)" if $RobertCraven[$i] ne $kennethk[$i +]; } } sub comparison { my ($n, $letters, $count) = @_; printf "%10d %2d %2d\n", @_; my $input = join '', map { ('a'..'z')[rand($letters)] } 0..$n-1; cmpthese($count, { $n < 25000 ? ( 'RobertCraven'=> sub{RobertCraven($input)}, ) : (), 'hdb' => sub{hdb($input)}, 'kennethk' => sub{kennethk($input)}, }); print "\n"; } sub RobertCraven { my $string = shift; my $length = length($string); my @ordered_list = sort map { substr($string, $_) } 0 .. $length - + 1; my @indices = map $length - length($_) + 1, @ordered_list; return @indices; } sub hdb { my $string = shift; my @array = sort { substr( $string, $a ) cmp substr( $string, $b ) + } 0..length($string)-1; $_++ for @array; return @array; } sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
and output:
20000 26 20 Rate hdb RobertCraven kennethk hdb 2.96/s -- -11% -49% RobertCraven 3.32/s 12% -- -43% kennethk 5.80/s 96% 75% -- 20000 4 20 Rate hdb RobertCraven kennethk hdb 2.96/s -- -9% -51% RobertCraven 3.25/s 10% -- -46% kennethk 6.08/s 105% 87% -- 200000 26 5 s/iter hdb kennethk hdb 47.0 -- -95% kennethk 2.13 2107% -- 200000 4 5 s/iter hdb kennethk hdb 45.2 -- -95% kennethk 2.05 2104% --

Update: Somehow, some bugs had slipped through; a testament to the power of combinatorics. Updated all of the above, which reduces the impressiveness of the original result, but still yields advantages. Metrics are fairly volatile.


#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.


In reply to Re: suffix array efficiency by kennethk
in thread suffix array efficiency by RobertCraven

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.