suffix array efficiency

RobertCraven has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: suffix array efficiency by hdb (Monsignor) on Jan 20, 2014 at 12:07 UTC
Instead of creating the suffixes, you can directly sort the indices based on a call to substr in sort. My machine does this in less than 2 minutes for a random string of length 200,000. `use strict; use warnings; sub suffix { my $string = shift; my @array = sort { substr( $string, $a ) cmp substr( $string, $b ) + } 0..length($string)-1; $_++ for @array; return @array; } my $n = 200000; my $input = join '', map { ('a'..'z')[rand(26)] } 0..$n-1; my @result = suffix $input; print "@result\n";` [download]	[reply] [d/l]
Re^2: suffix array efficiency by RobertCraven (Sexton) on Jan 20, 2014 at 13:31 UTC
Thanks, works well!	[reply]
Re: suffix array efficiency by kennethk (Abbot) on Jan 20, 2014 at 17:00 UTC
Executive summary: 20x speed-up for 200k bp hdb's solution fixes the the presumed `Out of memory!` error you were seeing (it kicks in on my desktop somewhere around 50k), but dragging around the whole suffix when you don't need nearly that much string to solve your question is a little silly. Rather than that, you could try grabbing say 10 characters at a time, and then only grabbing more if you need it. `sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }` [download] ~~Second, doing 2 calls to substr on each cmp is excessive. You'll run out of memory if you grab the whole suffix (O(n^2) memory usage), but if you cache the bit you need, that should drop to O(mn).~~ (Found a bug that made caching unreliable; fixing it ruined its performance) Finally, I note the phrase "bp" in the post, and note that the demo set contains 2 characters; my guess is that this is a bioinformatics question, and checking w/ 4 letters would, if anything, hurt my code. I've therefore also benchmarked w/ `rand(4)` instead of `rand(26)`. Long story short: 2107% speed-up over hdb's solution. Test script: Read more... (3 kB) and output: Read more... (950 Bytes) Update:* Somehow, some bugs had slipped through; a testament to the power of combinatorics. Updated all of the above, which reduces the impressiveness of the original result, but still yields advantages. Metrics are fairly volatile. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^2: suffix array efficiency by hdb (Monsignor) on Jan 20, 2014 at 17:40 UTC
Fascinating! But should it not be `$b+$off` in `my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp substr($string, $b, +$step)){ $off += $step; } } 0 .. length($string) - 1;` [download]	[reply] [d/l] [select]
Re^3: suffix array efficiency by kennethk (Abbot) on Jan 20, 2014 at 18:20 UTC
I have no idea how that happened. Corrected the node, re-ran the script, found a bug in the caching code, updated results, etc. Thanks. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply]
Re^2: suffix array efficiency by oiskuu (Hermit) on Jan 20, 2014 at 18:44 UTC
Please include `comparison(20000,1,10);` in your bench. Your caching strategy blows up in the degenerate case where `$string eq "aaaa"...` I'd suggest something simpler: `sub xxx { my $string = shift; my @cache = unpack q(xX3 .@0/(NX3)), $string."\0\0\0"; my @indices = sort {$cache[$a] <=> $cache[$b] \|\| substr($string, $a) cmp substr($string, $b) } 0 .. $#cache; $_++ for @indices; return @indices; }` [download] Probably only works in C locale. Correctness unverified. Update:* Oops, copy-pasta fix.	[reply] [d/l] [select]
Re^3: suffix array efficiency by kennethk (Abbot) on Jan 20, 2014 at 19:09 UTC
As mentioned in the parent, I found a bug in the caching code. I'm not worried about the degenerate case, though obviously my approach incurs a penalty in worst case scenario. I was going to benchmark your code, but couldn't get it to pass my updated validation routine: sub verify { for my $letters (1, 4, 26) { my $input = join '', map { ('a'..'z')[rand($letters)] } 0..200 +00-1; my @RobertCraven = RobertCraven($input); my @hdb = hdb($input); my @kennethk = kennethk($input); my @xxx = xxx($input); die "Mismatch 1 ($letters)" if @RobertCraven != @hdb; die "Mismatch 2 ($letters)" if @RobertCraven != @kennethk; die "Mismatch 3 ($letters)" if @RobertCraven != @xxx; for my $i (0 .. $#RobertCraven) { die "Char mismatch 1($letters, $i)" if $RobertCraven[$i] n +e $hdb[$i]; die "Char mismatch 2($letters, $i)" if $RobertCraven[$i] n +e $kennethk[$i]; die "Char mismatch 3($letters, $i)" if $RobertCraven[$i] n +e $xxx[$i]; } } } [download] Does it pass on your machine? Update: Parental code now passes tests. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l]
Re^3: suffix array efficiency by kennethk (Abbot) on Jan 20, 2014 at 19:23 UTC
Benchmark results: 20000 26 20 Rate hdb RobertCraven kennethk xx +x hdb 2.57/s -- -8% -55% -84 +% RobertCraven 2.81/s 9% -- -51% -82 +% kennethk 5.77/s 124% 106% -- -64 +% xxx 15.8/s 515% 464% 174% - +- 20000 4 20 Rate hdb RobertCraven xxx kenneth +k hdb 2.82/s -- -11% -46% -54 +% RobertCraven 3.18/s 13% -- -40% -48 +% xxx 5.28/s 87% 66% -- -13 +% kennethk 6.08/s 115% 91% 15% - +- 200000 26 5 s/iter kennethk xxx kennethk 2.05 -- -32% xxx 1.40 47% -- 200000 4 5 s/iter xxx kennethk xxx 24.6 -- -92% kennethk 2.07 1090% -- [download] Never been particularly proficient with pack, but it looks like your approach falls apart with the restricted character set. OTOH, that's an impressive result for [20000,26,20]. I imagine optimizing your unpack could yield some impressive optimization. And, FYI, the windowing approach is about 30x slower than other options with only 1 letter. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l]
Re^4: suffix array efficiency by oiskuu (Hermit) on Jan 20, 2014 at 19:50 UTC
Re^5: suffix array efficiency by kennethk (Abbot) on Jan 20, 2014 at 20:16 UTC
Re^2: suffix array efficiency by hdb (Monsignor) on Jan 21, 2014 at 09:27 UTC
The following does away with substr alltogether and uses your until loop on the array of characters of the string. This is probably the closest to a C implementation. It is still around a factor 3 slower than your cached version... `sub suffix3 { my @input = split //, $_[0]; my $off = 0; my @array = sort { $off=0; until( ($input[$a+$off]//'') cmp ($inpu +t[$b+$off]//'') ) { $off++ } } 0..@input-1; $_++ for @array; return @array; }` [download]	[reply] [d/l]
Re: suffix array efficiency by BrowserUk (Patriarch) on Jan 20, 2014 at 20:05 UTC
May I ask how you are using the suffix array you are constructing? (Perhaps a short example?) I ask because, as you've discovered, constructing suffix arrays in Perl -- as opposed to C or other languages where you can manipulate pointers to avoid large scale memory duplication -- is an expensive operation, and in the past I've found that then using them also carries high costs. Generally, I've found that there is usually a different approach to the underlying problem that works out substantially more efficient by avoiding the construction of the suffix array. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: suffix array efficiency by johngg (Canon) on Jan 20, 2014 at 22:57 UTC
Adapting code from this thread gives this ST sort. `use strict; use warnings; use 5.010; my @words = qw{ hypersensitivities zoology chlorofluorocarbons reinterpretations mississipi }; foreach my $word ( @words ) { say qq{\n $word }; printf qq{%2d - %s\n}, $_, substr $word, $_ - 1 for sortPosns( $word ); } sub sortPosns { return map { $_->[ 0 ] } sort { $a->[ 1 ] cmp $b->[ 1 ] } map { [ $_, substr $_[ 0 ], $_ - 1 ] } 1 .. length $_[ 0 ]; }` [download] The output. Read more... (1336 Bytes) Weirdly satisfying throwing odd words at it! I've no idea how it might perform in the benchmarks but I thought I'd post in case it is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: suffix array efficiency by kennethk (Abbot) on Jan 21, 2014 at 15:20 UTC
Popped it in the benchmark, just for fun. Suffers from `Out of memory!` at large strings (like RobertCraven's original), but fastest of the full string comparisons: 20000 26 20 Rate hdb RobertCraven sortPosns kennethk + xxx hdb 2.73/s -- -15% -41% -55% + -84% RobertCraven 3.20/s 17% -- -31% -48% + -81% sortPosns 4.61/s 69% 44% -- -24% + -73% kennethk 6.11/s 124% 91% 32% -- + -64% xxx 17.1/s 527% 435% 271% 180% + -- 20000 4 20 Rate hdb RobertCraven sortPosns xxx + kennethk hdb 2.95/s -- -10% -36% -43% + -52% RobertCraven 3.29/s 11% -- -29% -36% + -46% sortPosns 4.63/s 57% 41% -- -10% + -25% xxx 5.15/s 74% 57% 11% -- + -16% kennethk 6.13/s 108% 87% 33% 19% + -- [download] #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]