in reply to How do I efficiently predeclare a hash with a set of keys

I was going to suggest that depending on the sparseness of your keys, and whether they are all positive integers, it might be a faster to use exists to test an array. Before I made that assertion, I thought I should write a benchmark.

The results were that the theoretical optimization is pretty useless. The speeds were basically identical.

So, unless someone cares to tell me how badly I screwed up my benchmark, (proper benchmarking is hard, and I don't write them often) I'd have to say "Don't bother with my silly array idea."

Update: BrowserUK was kind enough to demonstrate that my benchmarks were flawed and that the arrays are indeed about 37% faster.

use strict; use warnings; use Benchmark; use List::Util qw(shuffle); my $count=5000000; my @existing_indexes = ( 5,6,8..53,62..106); my @indexes = (0..200); my @shuffled_indexes = shuffle @indexes; my %test_hash; my @test_array; @test_array[ @existing_indexes ] = (); @test_hash{ @existing_indexes } = (); Benchmark::cmpthese($count, { arrayLookup => q{ my $val = &arrayLookup( \@test_array, \@indexes ); }, hashLookup => q{ my $val = &hashLookup( \%test_hash, \@indexes ); }, shuffled_arrayLookup => q{ my $val = &arrayLookup( \@test_array, \@shuffled_indexes ); }, shuffled_hashLookup => q{ my $val = &hashLookup( \%test_hash, \@shuffled_indexes ) }, } ); sub arrayLookup { my $array = shift; my $indexes = shift; my @results = map { exists $array->[$_]; } @$indexes; return \@results; } sub hashLookup { my $hash = shift; my $keys = shift; my @results = map { exists $hash->{$_}; } @$keys; return \@results; }

And the results

Run 1: Rate shuffled_arrayLookup hashLookup shuffled +_hashLookup arrayLookup shuffled_arrayLookup 409199/s -- -1% + -1% -2% hashLookup 411286/s 1% -- + -1% -2% shuffled_hashLookup 415076/s 1% 1% + -- -1% arrayLookup 417781/s 2% 2% + 1% -- Run 2: shuffled_arrayLookup arrayLookup shuffled_hashLookup hashLookup 412371/s -- -0% +-1% -1% shuffled_arrayLookup 413976/s 0% -- +-0% -1% arrayLookup 414491/s 1% 0% + -- -1% shuffled_hashLookup 416667/s 1% 1% + 1% -- C:\Documents and Settings\Mark\Desktop\Downloads> exists.pl Rate shuffled_arrayLookup hashLookup shuffled +_hashLookup arrayLookup shuffled_arrayLookup 409199/s -- -1% + -1% -2% hashLookup 411286/s 1% -- + -1% -2% shuffled_hashLookup 415076/s 1% 1% + -- -1% arrayLookup 417781/s 2% 2% + 1% -- Run 3: Rate hashLookup shuffled_arrayLookup arrayLoo +kup shuffled_hashLookup hashLookup 412371/s -- -0% +-1% -1% shuffled_arrayLookup 413976/s 0% -- +-0% -1% arrayLookup 414491/s 1% 0% + -- -1% shuffled_hashLookup 416667/s 1% 1% + 1% --


TGI says moo

Replies are listed 'Best First'.
Re^2: How do I efficiently predeclare a hash with a set of keys
by BrowserUk (Patriarch) on May 18, 2007 at 23:29 UTC

    But the main problem with your benchmark is that by pushing the code under test inside a subroutine, the mechanics of calling the subroutine, unpacking the parameters, and allocating memory to accumulate the results; all combine to entirely swamp the actual thing you are trying to test.

    Adding a direct test of each of the two possibilities to your benchmark dramatically shows up the cost of all that indirection, as well as confirming your original suspicion that arrays are 37% faster than hashes:

    use strict; use warnings; use Benchmark; use List::Util qw(shuffle); my $count=5000000; my @existing_indexes = ( 5,6,8..53,62..106); our @indexes = (0..200); our @shuffled_indexes = shuffle @indexes; our %test_hash; our @test_array; @test_array[ @existing_indexes ] = (); @test_hash{ @existing_indexes } = (); Benchmark::cmpthese( -1, { AviaSub => q{ my $val = &arrayLookup( \@test_array, \@indexes ); }, HviaSub => q{ my $val = &hashLookup( \%test_hash, \@indexes ); }, AviaSubS => q{ my $val = &arrayLookup( \@test_array, \@shuffled_indexes ); }, HviaSubS => q{ my $val = &hashLookup( \%test_hash, \@shuffled_indexes ) }, Adirect => q{ my $count = 0; exists $test_array[ $_ ] and ++$count for @indexes; }, Hdirect => q{ my $count = 0; exists $test_hash{ $_ } and ++$count for @indexes; }, } ); sub arrayLookup { my $array = shift; my $indexes = shift; my @results = map { exists $array->[$_]; } @$indexes; return \@results; } sub hashLookup { my $hash = shift; my $keys = shift; my @results = map { exists $hash->{$_}; } @$keys; return \@results; }

    The results:

    c:\test>junk3 Rate HviaSub HviaSubS AviaSub AviaSubS Hdirect Adirect HviaSub 2085/s -- -1% -2% -5% -85% -89% HviaSubS 2113/s 1% -- -1% -4% -85% -89% AviaSub 2131/s 2% 1% -- -3% -85% -89% AviaSubS 2203/s 6% 4% 3% -- -84% -89% Hdirect 14147/s 578% 569% 564% 542% -- -27% Adirect 19366/s 829% 816% 809% 779% 37% --

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: How do I efficiently predeclare a hash with a set of keys
by BrowserUk (Patriarch) on May 18, 2007 at 23:17 UTC

    The first problem is that your arrays are lexical, but you are using strings not sub for your tests. That means that they cannot not be captured as closures. So, you are effectively testing empty arrays and hashes which is why the iteration counts are so high. To test this, insert a print statement inside the first test and run the test for a count of 1:

    ... Benchmark::cmpthese( 1, { arrayLookup => q{ print scalar( @test_array ), ' : ', scalar( @indexes ); my $val = &arrayLookup( \@test_array, \@indexes ); }, ... __OUPUT__ c:\test>junk3 0 : 0 (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate shuffled_arrayLookup shuffled_ +hashLookup hashLookup arrayLookup shuffled_arrayLookup 1000000000000000/s -- + 0% 0% 0% shuffled_hashLookup 1000000000000000/s 0% + -- 0% 0% hashLookup 1000000000000000/s 0% + 0% -- 0% arrayLookup 1000000000000000/s 0% + 0% 0% --

    If you are going to use strings, then your external variables have to be declared with our not my. Doing that and re-running the above sanity check gives:

    c:\test>junk3 107 : 201 (warning: too few iterations for a reliable count +) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate shuffled_arrayLookup shuffled_ +hashLookup hashLookup arrayLookup shuffled_arrayLookup 1000000000000000/s -- + 0% 0% 0% shuffled_hashLookup 1000000000000000/s 0% + -- 0% 0% hashLookup 1000000000000000/s 0% + 0% -- 0% arrayLookup 1000000000000000/s 0% + 0% 0% --

    Running the benchmark once that change is made gives:

    c:\test>junk3 Rate shuffled_hashLookup hashLookup arrayLookup + shuffled_arrayLookup shuffled_hashLookup 2113/s -- -0% -3% + -5% hashLookup 2113/s 0% -- -3% + -5% arrayLookup 2181/s 3% 3% -- + -2% shuffled_arrayLookup 2233/s 6% 6% 2% + --

    Not a whole heap of difference in the percentages, but take a close look at the two orders of magnitude drop in the number of iterations/second.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.