RobertCraven has asked for the wisdom of the Perl Monks concerning the following question:

Hello Everyone,
I have question regarding efficiency of suffix arrays in Perl.

If string = abaaba
suffix_array = (6, 3, 4, 1, 5, 2)
corresponding to ('a', 'aaba', 'aba', 'abaaba', 'ba', 'baaba')

A brutal approach is to create the list of suffixes
('abaaba', 'baaba', 'aaba', 'aba', 'ba', 'a')
then order it
('a', 'aaba', 'aba', 'abaaba', 'ba', 'baaba')
then extract the positions
(6, 3, 4, 1, 5, 2)

Easily done in perl with:
@ordered_list = sort map { substr($string, $_) } 0 .. $length - 1;

foreach my $elt (@ordered_list) {
print $length - length($elt) . "\n";
}

If string is very long (working with 200.000bp here), the sort does not cope

Any suggestions?

Replies are listed 'Best First'.
Re: suffix array efficiency
by hdb (Monsignor) on Jan 20, 2014 at 12:07 UTC

    Instead of creating the suffixes, you can directly sort the indices based on a call to substr in sort. My machine does this in less than 2 minutes for a random string of length 200,000.

    use strict; use warnings; sub suffix { my $string = shift; my @array = sort { substr( $string, $a ) cmp substr( $string, $b ) + } 0..length($string)-1; $_++ for @array; return @array; } my $n = 200000; my $input = join '', map { ('a'..'z')[rand(26)] } 0..$n-1; my @result = suffix $input; print "@result\n";
      Thanks, works well!
Re: suffix array efficiency
by kennethk (Abbot) on Jan 20, 2014 at 17:00 UTC
    Executive summary: 20x speed-up for 200k bp

    hdb's solution fixes the the presumed Out of memory! error you were seeing (it kicks in on my desktop somewhere around 50k), but dragging around the whole suffix when you don't need nearly that much string to solve your question is a little silly. Rather than that, you could try grabbing say 10 characters at a time, and then only grabbing more if you need it.

    sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
    Second, doing 2 calls to substr on each cmp is excessive. You'll run out of memory if you grab the whole suffix (O(n^2) memory usage), but if you cache the bit you need, that should drop to O(m*n). (Found a bug that made caching unreliable; fixing it ruined its performance)

    Finally, I note the phrase "bp" in the post, and note that the demo set contains 2 characters; my guess is that this is a bioinformatics question, and checking w/ 4 letters would, if anything, hurt my code. I've therefore also benchmarked w/ rand(4) instead of rand(26). Long story short: 2107% speed-up over hdb's solution. Test script:

    and output:

    Update: Somehow, some bugs had slipped through; a testament to the power of combinatorics. Updated all of the above, which reduces the impressiveness of the original result, but still yields advantages. Metrics are fairly volatile.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Fascinating! But should it not be $b+$off in

      my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp substr($string, $b, +$step)){ $off += $step; } } 0 .. length($string) - 1;

        I have no idea how that happened. Corrected the node, re-ran the script, found a bug in the caching code, updated results, etc. Thanks.


        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Please include comparison(20000,1,10); in your bench.
      Your caching strategy blows up in the degenerate case where $string eq "aaaa"...

      I'd suggest something simpler:

      sub xxx { my $string = shift; my @cache = unpack q(x*X3 .@0/(NX3)), $string."\0\0\0"; my @indices = sort {$cache[$a] <=> $cache[$b] || substr($string, $a) cmp substr($string, $b) } 0 .. $#cache; $_++ for @indices; return @indices; }
      Probably only works in C locale. Correctness unverified.

      Update: Oops, copy-pasta fix.

        As mentioned in the parent, I found a bug in the caching code. I'm not worried about the degenerate case, though obviously my approach incurs a penalty in worst case scenario. I was going to benchmark your code, but couldn't get it to pass my updated validation routine:
        sub verify { for my $letters (1, 4, 26) { my $input = join '', map { ('a'..'z')[rand($letters)] } 0..200 +00-1; my @RobertCraven = RobertCraven($input); my @hdb = hdb($input); my @kennethk = kennethk($input); my @xxx = xxx($input); die "Mismatch 1 ($letters)" if @RobertCraven != @hdb; die "Mismatch 2 ($letters)" if @RobertCraven != @kennethk; die "Mismatch 3 ($letters)" if @RobertCraven != @xxx; for my $i (0 .. $#RobertCraven) { die "Char mismatch 1($letters, $i)" if $RobertCraven[$i] n +e $hdb[$i]; die "Char mismatch 2($letters, $i)" if $RobertCraven[$i] n +e $kennethk[$i]; die "Char mismatch 3($letters, $i)" if $RobertCraven[$i] n +e $xxx[$i]; } } }
        Does it pass on your machine?

        Update: Parental code now passes tests.


        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        Benchmark results:
        20000 26 20 Rate hdb RobertCraven kennethk xx +x hdb 2.57/s -- -8% -55% -84 +% RobertCraven 2.81/s 9% -- -51% -82 +% kennethk 5.77/s 124% 106% -- -64 +% xxx 15.8/s 515% 464% 174% - +- 20000 4 20 Rate hdb RobertCraven xxx kenneth +k hdb 2.82/s -- -11% -46% -54 +% RobertCraven 3.18/s 13% -- -40% -48 +% xxx 5.28/s 87% 66% -- -13 +% kennethk 6.08/s 115% 91% 15% - +- 200000 26 5 s/iter kennethk xxx kennethk 2.05 -- -32% xxx 1.40 47% -- 200000 4 5 s/iter xxx kennethk xxx 24.6 -- -92% kennethk 2.07 1090% --
        Never been particularly proficient with pack, but it looks like your approach falls apart with the restricted character set. OTOH, that's an impressive result for [20000,26,20]. I imagine optimizing your unpack could yield some impressive optimization.

        And, FYI, the windowing approach is about 30x slower than other options with only 1 letter.


        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      The following does away with substr alltogether and uses your until loop on the array of characters of the string. This is probably the closest to a C implementation. It is still around a factor 3 slower than your cached version...

      sub suffix3 { my @input = split //, $_[0]; my $off = 0; my @array = sort { $off=0; until( ($input[$a+$off]//'') cmp ($inpu +t[$b+$off]//'') ) { $off++ } } 0..@input-1; $_++ for @array; return @array; }
Re: suffix array efficiency
by BrowserUk (Patriarch) on Jan 20, 2014 at 20:05 UTC

    May I ask how you are using the suffix array you are constructing? (Perhaps a short example?)

    I ask because, as you've discovered, constructing suffix arrays in Perl -- as opposed to C or other languages where you can manipulate pointers to avoid large scale memory duplication -- is an expensive operation, and in the past I've found that then using them also carries high costs.

    Generally, I've found that there is usually a different approach to the underlying problem that works out substantially more efficient by avoiding the construction of the suffix array.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: suffix array efficiency
by johngg (Canon) on Jan 20, 2014 at 22:57 UTC
    Adapting code from this thread gives this ST sort.

    use strict; use warnings; use 5.010; my @words = qw{ hypersensitivities zoology chlorofluorocarbons reinterpretations mississipi }; foreach my $word ( @words ) { say qq{\n** $word **}; printf qq{%2d - %s\n}, $_, substr $word, $_ - 1 for sortPosns( $word ); } sub sortPosns { return map { $_->[ 0 ] } sort { $a->[ 1 ] cmp $b->[ 1 ] } map { [ $_, substr $_[ 0 ], $_ - 1 ] } 1 .. length $_[ 0 ]; }

    The output.

    Weirdly satisfying throwing odd words at it! I've no idea how it might perform in the benchmarks but I thought I'd post in case it is of interest.

    Cheers,

    JohnGG

      Popped it in the benchmark, just for fun. Suffers from Out of memory! at large strings (like RobertCraven's original), but fastest of the full string comparisons:
      20000 26 20 Rate hdb RobertCraven sortPosns kennethk + xxx hdb 2.73/s -- -15% -41% -55% + -84% RobertCraven 3.20/s 17% -- -31% -48% + -81% sortPosns 4.61/s 69% 44% -- -24% + -73% kennethk 6.11/s 124% 91% 32% -- + -64% xxx 17.1/s 527% 435% 271% 180% + -- 20000 4 20 Rate hdb RobertCraven sortPosns xxx + kennethk hdb 2.95/s -- -10% -36% -43% + -52% RobertCraven 3.29/s 11% -- -29% -36% + -46% sortPosns 4.63/s 57% 41% -- -10% + -25% xxx 5.15/s 74% 57% 11% -- + -16% kennethk 6.13/s 108% 87% 33% 19% + --

      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.