in reply to suffix array efficiency
hdb's solution fixes the the presumed Out of memory! error you were seeing (it kicks in on my desktop somewhere around 50k), but dragging around the whole suffix when you don't need nearly that much string to solve your question is a little silly. Rather than that, you could try grabbing say 10 characters at a time, and then only grabbing more if you need it.
sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
Finally, I note the phrase "bp" in the post, and note that the demo set contains 2 characters; my guess is that this is a bioinformatics question, and checking w/ 4 letters would, if anything, hurt my code. I've therefore also benchmarked w/ rand(4) instead of rand(26). Long story short: 2107% speed-up over hdb's solution. Test script:
use strict; use warnings; use Benchmark ':all'; verify(); comparison(20000,26,20); comparison(20000,4,20); comparison(200000,26,5); comparison(200000,4,5); sub verify { my $input = join '', map { ('a'..'z')[rand(4)] } 0..20000-1; my @RobertCraven = RobertCraven($input); my @hdb = hdb($input); my @kennethk = kennethk($input); die "Mismatch 1" if @RobertCraven != @hdb; die "Mismatch 2" if @RobertCraven != @kennethk; for my $i (0 .. $#RobertCraven) { die "Char mismatch 1($i)" if $RobertCraven[$i] ne $hdb[$i]; die "Char mismatch 2($i)" if $RobertCraven[$i] ne $kennethk[$i +]; } } sub comparison { my ($n, $letters, $count) = @_; printf "%10d %2d %2d\n", @_; my $input = join '', map { ('a'..'z')[rand($letters)] } 0..$n-1; cmpthese($count, { $n < 25000 ? ( 'RobertCraven'=> sub{RobertCraven($input)}, ) : (), 'hdb' => sub{hdb($input)}, 'kennethk' => sub{kennethk($input)}, }); print "\n"; } sub RobertCraven { my $string = shift; my $length = length($string); my @ordered_list = sort map { substr($string, $_) } 0 .. $length - + 1; my @indices = map $length - length($_) + 1, @ordered_list; return @indices; } sub hdb { my $string = shift; my @array = sort { substr( $string, $a ) cmp substr( $string, $b ) + } 0..length($string)-1; $_++ for @array; return @array; } sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
20000 26 20 Rate hdb RobertCraven kennethk hdb 2.96/s -- -11% -49% RobertCraven 3.32/s 12% -- -43% kennethk 5.80/s 96% 75% -- 20000 4 20 Rate hdb RobertCraven kennethk hdb 2.96/s -- -9% -51% RobertCraven 3.25/s 10% -- -46% kennethk 6.08/s 105% 87% -- 200000 26 5 s/iter hdb kennethk hdb 47.0 -- -95% kennethk 2.13 2107% -- 200000 4 5 s/iter hdb kennethk hdb 45.2 -- -95% kennethk 2.05 2104% --
Update: Somehow, some bugs had slipped through; a testament to the power of combinatorics. Updated all of the above, which reduces the impressiveness of the original result, but still yields advantages. Metrics are fairly volatile.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: suffix array efficiency
by hdb (Monsignor) on Jan 20, 2014 at 17:40 UTC | |
by kennethk (Abbot) on Jan 20, 2014 at 18:20 UTC | |
|
Re^2: suffix array efficiency
by oiskuu (Hermit) on Jan 20, 2014 at 18:44 UTC | |
by kennethk (Abbot) on Jan 20, 2014 at 19:09 UTC | |
by kennethk (Abbot) on Jan 20, 2014 at 19:23 UTC | |
by oiskuu (Hermit) on Jan 20, 2014 at 19:50 UTC | |
by kennethk (Abbot) on Jan 20, 2014 at 20:16 UTC | |
|
Re^2: suffix array efficiency
by hdb (Monsignor) on Jan 21, 2014 at 09:27 UTC |