in reply to Sorting a hash of arrays based on a particular array element.

I read the A Fresh Look at Efficient Perl Sorting paper last week and have been itching to play with packed default sorting so here we go, packed default verses the example code BrowserUK gave above. Sorting two million random records 10 times.

=>cat /tmp/sorttest #!/usr/opt/perl5/bin/perl use strict; use warnings; use Benchmark; srand(42); # the answer my $elements = 2000000; my %hash; for (1..$elements) { my ($key, $one, $two) = rand =~/(\d\.\d{2})(\d{2})(\d{4})/; $hash{$key} = [$one, $two]; } sub packed_default { my @sorted_keys = map substr($_,1), sort map {pack('C', $hash{$_}[0])."$_"} keys %hash; return @sorted_keys; } sub BroweserUK { my @sorted_keys = sort{ $hash{$a}[0] <=> $hash{$b}[0] } keys %hash +; } timethese(10, { 'packed_default' => \&packed_default, 'BroweserUK' => \&BroweserUK, }); __end__ Benchmark: timing 10 iterations of BroweserUK, packed_default... BroweserUK: 0 wallclock secs (0.07 usr + 0.00 sys = 0.07 CPU) @ 142.86/s (n=10) (warning: too few iterations for a reliable count) packed_default: 0 wallclock secs (0.04 usr + 0.00 sys = 0.04 CPU) @ 250.00/s (n=10) (warning: too few iterations for a reliable count)

As they suggest in the paper coercing perl to use the default sort instead of custom sub is almost twice as fast.

Cheers,
R.

Pereant, qui ante nos nostra dixerunt!

Replies are listed 'Best First'.
Re^2: Sorting a hash of arrays based on a particular array element.
by salva (Canon) on Jul 27, 2005 at 18:54 UTC
    let me extend your benchmark with an entry for Sort::Key:
    ... use Sort::Key qw(ikeysort); sub SortKey { my @sorted_keys = ikeysort { $hash{$_}[0] } keys %hash; } cmpthese(10, { BrowserUK => \&BrowserUK, packed_default => \&packed_default, SortKey => \&SortKey }); __end__ Rate BrowserUK packed_default SortKey BrowserUK 50.0/s -- -30% -45% packed_default 71.4/s 43% -- -21% SortKey 90.9/s 82% 27% --
    BTW, in my computer the difference between packed and BrowserUK method is not so big as on your test.

      Wow, most impressive. How do you do the sorting under the covers, can you point me to some online resources ? As it is so much faster than perl's sort will it ever make it into core and replace the current sort or does it gain speed by having different interfaces for sorting different data types ?

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
        well, actually Sort::Key uses the internal perl sort function under the hood. It just exposes a different API that can cover most sorting usages without incurring in an extra cost as core sort does.

        The algorithm used is similar to...

        sub sortkey { my ($func, @v); my @keys = map { &$func } @v; my @ix = sort { $keys[$a] cmp $keys[$b] } 0..$#keys; return @v[@ix] }
        but being implemented in C (specially, the comparing callbacks) it doesn't require the nasty hacks of the GRT to do the sort part efficiently.

      Oh silly silly me ... 2 million values thrown into a hash with only 100 possible keys !!!

      # for (1..$elements) { # my ($key, $one, $two) = rand =~/(\d\.\d{2})(\d{2})(\d{4})/; # $hash{$key} = [$one, $two]; #} for (1..$elements) { my ($one, $two) = rand =~/\d+.(\d{6})(\d{4})/; $hash{$_} = [$one, $two]; } # Also reduced $elements to 200_000 and 2 million # exceeded per proc memeory limit. # I also use more entropy in the field to sort on # requiring a change of packed_default to ... sub packed_default { my @sorted_keys = map substr($_,4), sort map {pack('N', $hash{$_}[0])."$_"} keys %hash; return @sorted_keys; } Benchmark: timing 10 iterations of BroweserUK, packed_default... BroweserUK: 188 wallclock secs (175.63 usr + 0.13 sys = 175.76 CPU) @ 0.06/s (n=10) packed_default: 72 wallclock secs (64.61 usr + 0.16 sys = 64.77 CPU) @ 0.15/s (n=10)

      Sadly I am on a company machine with a restrictive policy about installing modules so I can not test Sort::Key here, will give it a bash at home this evening.

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
        on my computer (for 1M elements):
        s/iter BrowserUK packed_default SortKey BrowserUK 71.7 -- -66% -79% packed_default 24.5 193% -- -38% SortKey 15.2 372% 61% --
        another important point for sorting algorithms is their memory comsumption, and the GRT offers the worst behaviour for that aspect:
        use Memchmark qw(cmpthese); ... cmpthese(BrowserUK => \&BrowserUK, packed_default => \&packed_default, SortKey => \&SortKey ); __end__ test: BrowserUK, memory used: 102006784 bytes test: SortKey, memory used: 85598208 bytes test: packed_default, memory used: 182788096 bytes
        (Memchmark results are approximate, but accurate enough to show the difference)