in reply to suffix array efficiency

Executive summary: 20x speed-up for 200k bp

hdb's solution fixes the the presumed Out of memory! error you were seeing (it kicks in on my desktop somewhere around 50k), but dragging around the whole suffix when you don't need nearly that much string to solve your question is a little silly. Rather than that, you could try grabbing say 10 characters at a time, and then only grabbing more if you need it.

sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
Second, doing 2 calls to substr on each cmp is excessive. You'll run out of memory if you grab the whole suffix (O(n^2) memory usage), but if you cache the bit you need, that should drop to O(m*n). (Found a bug that made caching unreliable; fixing it ruined its performance)

Finally, I note the phrase "bp" in the post, and note that the demo set contains 2 characters; my guess is that this is a bioinformatics question, and checking w/ 4 letters would, if anything, hurt my code. I've therefore also benchmarked w/ rand(4) instead of rand(26). Long story short: 2107% speed-up over hdb's solution. Test script:

use strict; use warnings; use Benchmark ':all'; verify(); comparison(20000,26,20); comparison(20000,4,20); comparison(200000,26,5); comparison(200000,4,5); sub verify { my $input = join '', map { ('a'..'z')[rand(4)] } 0..20000-1; my @RobertCraven = RobertCraven($input); my @hdb = hdb($input); my @kennethk = kennethk($input); die "Mismatch 1" if @RobertCraven != @hdb; die "Mismatch 2" if @RobertCraven != @kennethk; for my $i (0 .. $#RobertCraven) { die "Char mismatch 1($i)" if $RobertCraven[$i] ne $hdb[$i]; die "Char mismatch 2($i)" if $RobertCraven[$i] ne $kennethk[$i +]; } } sub comparison { my ($n, $letters, $count) = @_; printf "%10d %2d %2d\n", @_; my $input = join '', map { ('a'..'z')[rand($letters)] } 0..$n-1; cmpthese($count, { $n < 25000 ? ( 'RobertCraven'=> sub{RobertCraven($input)}, ) : (), 'hdb' => sub{hdb($input)}, 'kennethk' => sub{kennethk($input)}, }); print "\n"; } sub RobertCraven { my $string = shift; my $length = length($string); my @ordered_list = sort map { substr($string, $_) } 0 .. $length - + 1; my @indices = map $length - length($_) + 1, @ordered_list; return @indices; } sub hdb { my $string = shift; my @array = sort { substr( $string, $a ) cmp substr( $string, $b ) + } 0..length($string)-1; $_++ for @array; return @array; } sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
and output:
20000 26 20 Rate hdb RobertCraven kennethk hdb 2.96/s -- -11% -49% RobertCraven 3.32/s 12% -- -43% kennethk 5.80/s 96% 75% -- 20000 4 20 Rate hdb RobertCraven kennethk hdb 2.96/s -- -9% -51% RobertCraven 3.25/s 10% -- -46% kennethk 6.08/s 105% 87% -- 200000 26 5 s/iter hdb kennethk hdb 47.0 -- -95% kennethk 2.13 2107% -- 200000 4 5 s/iter hdb kennethk hdb 45.2 -- -95% kennethk 2.05 2104% --

Update: Somehow, some bugs had slipped through; a testament to the power of combinatorics. Updated all of the above, which reduces the impressiveness of the original result, but still yields advantages. Metrics are fairly volatile.


#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Replies are listed 'Best First'.
Re^2: suffix array efficiency
by hdb (Monsignor) on Jan 20, 2014 at 17:40 UTC

    Fascinating! But should it not be $b+$off in

    my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp substr($string, $b, +$step)){ $off += $step; } } 0 .. length($string) - 1;

      I have no idea how that happened. Corrected the node, re-ran the script, found a bug in the caching code, updated results, etc. Thanks.


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re^2: suffix array efficiency
by oiskuu (Hermit) on Jan 20, 2014 at 18:44 UTC

    Please include comparison(20000,1,10); in your bench.
    Your caching strategy blows up in the degenerate case where $string eq "aaaa"...

    I'd suggest something simpler:

    sub xxx { my $string = shift; my @cache = unpack q(x*X3 .@0/(NX3)), $string."\0\0\0"; my @indices = sort {$cache[$a] <=> $cache[$b] || substr($string, $a) cmp substr($string, $b) } 0 .. $#cache; $_++ for @indices; return @indices; }
    Probably only works in C locale. Correctness unverified.

    Update: Oops, copy-pasta fix.

      As mentioned in the parent, I found a bug in the caching code. I'm not worried about the degenerate case, though obviously my approach incurs a penalty in worst case scenario. I was going to benchmark your code, but couldn't get it to pass my updated validation routine:
      sub verify { for my $letters (1, 4, 26) { my $input = join '', map { ('a'..'z')[rand($letters)] } 0..200 +00-1; my @RobertCraven = RobertCraven($input); my @hdb = hdb($input); my @kennethk = kennethk($input); my @xxx = xxx($input); die "Mismatch 1 ($letters)" if @RobertCraven != @hdb; die "Mismatch 2 ($letters)" if @RobertCraven != @kennethk; die "Mismatch 3 ($letters)" if @RobertCraven != @xxx; for my $i (0 .. $#RobertCraven) { die "Char mismatch 1($letters, $i)" if $RobertCraven[$i] n +e $hdb[$i]; die "Char mismatch 2($letters, $i)" if $RobertCraven[$i] n +e $kennethk[$i]; die "Char mismatch 3($letters, $i)" if $RobertCraven[$i] n +e $xxx[$i]; } } }
      Does it pass on your machine?

      Update: Parental code now passes tests.


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Benchmark results:
      20000 26 20 Rate hdb RobertCraven kennethk xx +x hdb 2.57/s -- -8% -55% -84 +% RobertCraven 2.81/s 9% -- -51% -82 +% kennethk 5.77/s 124% 106% -- -64 +% xxx 15.8/s 515% 464% 174% - +- 20000 4 20 Rate hdb RobertCraven xxx kenneth +k hdb 2.82/s -- -11% -46% -54 +% RobertCraven 3.18/s 13% -- -40% -48 +% xxx 5.28/s 87% 66% -- -13 +% kennethk 6.08/s 115% 91% 15% - +- 200000 26 5 s/iter kennethk xxx kennethk 2.05 -- -32% xxx 1.40 47% -- 200000 4 5 s/iter xxx kennethk xxx 24.6 -- -92% kennethk 2.07 1090% --
      Never been particularly proficient with pack, but it looks like your approach falls apart with the restricted character set. OTOH, that's an impressive result for [20000,26,20]. I imagine optimizing your unpack could yield some impressive optimization.

      And, FYI, the windowing approach is about 30x slower than other options with only 1 letter.


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        The degenerate (1-letter) case:

        20000 1 -5 kennethk 0.154/s RobertCraven 7.55/s xxx 23.0/s hdb 28.5/s
        The 4-letter case packs only 8 bits of data into 32-bit values and therefore performs poorly. With genomic data one would of course pack differently.

        Much more importantly: is there no XS module for constructing suffix arrays? Seems like a very worthy problem.

Re^2: suffix array efficiency
by hdb (Monsignor) on Jan 21, 2014 at 09:27 UTC

    The following does away with substr alltogether and uses your until loop on the array of characters of the string. This is probably the closest to a C implementation. It is still around a factor 3 slower than your cached version...

    sub suffix3 { my @input = split //, $_[0]; my $off = 0; my @array = sort { $off=0; until( ($input[$a+$off]//'') cmp ($inpu +t[$b+$off]//'') ) { $off++ } } 0..@input-1; $_++ for @array; return @array; }