in reply to Re^2: suffix array efficiency
in thread suffix array efficiency

Benchmark results:
20000 26 20 Rate hdb RobertCraven kennethk xx +x hdb 2.57/s -- -8% -55% -84 +% RobertCraven 2.81/s 9% -- -51% -82 +% kennethk 5.77/s 124% 106% -- -64 +% xxx 15.8/s 515% 464% 174% - +- 20000 4 20 Rate hdb RobertCraven xxx kenneth +k hdb 2.82/s -- -11% -46% -54 +% RobertCraven 3.18/s 13% -- -40% -48 +% xxx 5.28/s 87% 66% -- -13 +% kennethk 6.08/s 115% 91% 15% - +- 200000 26 5 s/iter kennethk xxx kennethk 2.05 -- -32% xxx 1.40 47% -- 200000 4 5 s/iter xxx kennethk xxx 24.6 -- -92% kennethk 2.07 1090% --
Never been particularly proficient with pack, but it looks like your approach falls apart with the restricted character set. OTOH, that's an impressive result for [20000,26,20]. I imagine optimizing your unpack could yield some impressive optimization.

And, FYI, the windowing approach is about 30x slower than other options with only 1 letter.


#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Replies are listed 'Best First'.
Re^4: suffix array efficiency
by oiskuu (Hermit) on Jan 20, 2014 at 19:50 UTC

    The degenerate (1-letter) case:

    20000 1 -5 kennethk 0.154/s RobertCraven 7.55/s xxx 23.0/s hdb 28.5/s
    The 4-letter case packs only 8 bits of data into 32-bit values and therefore performs poorly. With genomic data one would of course pack differently.

    Much more importantly: is there no XS module for constructing suffix arrays? Seems like a very worthy problem.

      I'm not a bioinformatics guy, but if massive runs are actually a significant concern in this system, it's pretty easy to fix the worst-case behavior by adaptively boosting the window, so that worst case regresses to hdb's solution:
      sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; $step *= 2; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
      which gives a degenerate-case comparison of
      RobertCraven 2.90/s kennethk 12.8/s xxx 13.1/s hdb 14.2/s
      and comparable behavior in the rest of the benchmarks.

      I also note your computer is twice as fast as mine. Wanna trade?


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.