lupey has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm trying to find an efficient way to predeclare a hash with a set of keys like 5,6,8-53,62-106 (see below). My set of keys are basically members of a set where the value is not important. For example, it is only important that exists($set{67}) is true.

I'm having trouble finding appropriate documentation on this issue. Does anybody have advice on my problem?

Thank you.

Paul

$set{5}++; $set{6}++; $set{8}++; $set{9}++; $set{10}++; $set{11}++; $set{12}++; ... ...

Replies are listed 'Best First'.
Re: How do I efficiently predeclare a hash with a set of keys
by jdporter (Paladin) on May 18, 2007 at 19:05 UTC

    The most efficient way to initialize a hash (other than directly, i.e. %h = ( key => value, ...)) is to assign hash slices:

    @set{5,6,8..53,62..106} = ();

    (Note that this makes the values all undef.)

    This technique works particularly well when using hashes as sets, as you are.

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
Re: How do I efficiently predeclare a hash with a set of keys
by BrowserUk (Patriarch) on May 18, 2007 at 19:19 UTC

    This is concise and saves a little space by not allocating any to the values.* The keys however still exist and test as such with exists.

    my %set; undef @set{ 5,6,8..53,62..106 };

    Update: This used to make a difference at some point in the past, but it no longer does (circa. 5.8.8 and possibly before).

    The null list assignment method others have described my %set; @set{ ... } = (); is equal in memory usage and actually slightly faster. Thanks to bart++ for bringing this to my attention.

    If you haven't voted on this node yet, please downvote it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      How does this compare, for speed and memory use, to
      @set{ 5,6,8..53,62..106 } = ();
      ? I'd expect them to be comparable.

        Good call. At some point in the past, the undef version used less memory and was faster than the null list assignment version. If I remember correctly, it was tilly that pointed this out to me.

        However, I just tried it with 5.8.8 and that is no longer the case. They both use the same amount of memory, and the null list assignment is marginally (~4%) faster. It's still faster and uses less memory than the typical method.

        my %set = map{ $_ => 1 } 1, 2, 5..8, 100..200;

        Another case of How do I know what I 'know' is right?. :( I'll update my node above to reflect this.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Thank you. This is exactly what I wanted :)
Re: How do I efficiently predeclare a hash with a set of keys
by TGI (Parson) on May 18, 2007 at 22:12 UTC

    I was going to suggest that depending on the sparseness of your keys, and whether they are all positive integers, it might be a faster to use exists to test an array. Before I made that assertion, I thought I should write a benchmark.

    The results were that the theoretical optimization is pretty useless. The speeds were basically identical.

    So, unless someone cares to tell me how badly I screwed up my benchmark, (proper benchmarking is hard, and I don't write them often) I'd have to say "Don't bother with my silly array idea."

    Update: BrowserUK was kind enough to demonstrate that my benchmarks were flawed and that the arrays are indeed about 37% faster.

    use strict; use warnings; use Benchmark; use List::Util qw(shuffle); my $count=5000000; my @existing_indexes = ( 5,6,8..53,62..106); my @indexes = (0..200); my @shuffled_indexes = shuffle @indexes; my %test_hash; my @test_array; @test_array[ @existing_indexes ] = (); @test_hash{ @existing_indexes } = (); Benchmark::cmpthese($count, { arrayLookup => q{ my $val = &arrayLookup( \@test_array, \@indexes ); }, hashLookup => q{ my $val = &hashLookup( \%test_hash, \@indexes ); }, shuffled_arrayLookup => q{ my $val = &arrayLookup( \@test_array, \@shuffled_indexes ); }, shuffled_hashLookup => q{ my $val = &hashLookup( \%test_hash, \@shuffled_indexes ) }, } ); sub arrayLookup { my $array = shift; my $indexes = shift; my @results = map { exists $array->[$_]; } @$indexes; return \@results; } sub hashLookup { my $hash = shift; my $keys = shift; my @results = map { exists $hash->{$_}; } @$keys; return \@results; }

    And the results

    Run 1: Rate shuffled_arrayLookup hashLookup shuffled +_hashLookup arrayLookup shuffled_arrayLookup 409199/s -- -1% + -1% -2% hashLookup 411286/s 1% -- + -1% -2% shuffled_hashLookup 415076/s 1% 1% + -- -1% arrayLookup 417781/s 2% 2% + 1% -- Run 2: shuffled_arrayLookup arrayLookup shuffled_hashLookup hashLookup 412371/s -- -0% +-1% -1% shuffled_arrayLookup 413976/s 0% -- +-0% -1% arrayLookup 414491/s 1% 0% + -- -1% shuffled_hashLookup 416667/s 1% 1% + 1% -- C:\Documents and Settings\Mark\Desktop\Downloads> exists.pl Rate shuffled_arrayLookup hashLookup shuffled +_hashLookup arrayLookup shuffled_arrayLookup 409199/s -- -1% + -1% -2% hashLookup 411286/s 1% -- + -1% -2% shuffled_hashLookup 415076/s 1% 1% + -- -1% arrayLookup 417781/s 2% 2% + 1% -- Run 3: Rate hashLookup shuffled_arrayLookup arrayLoo +kup shuffled_hashLookup hashLookup 412371/s -- -0% +-1% -1% shuffled_arrayLookup 413976/s 0% -- +-0% -1% arrayLookup 414491/s 1% 0% + -- -1% shuffled_hashLookup 416667/s 1% 1% + 1% --


    TGI says moo

      But the main problem with your benchmark is that by pushing the code under test inside a subroutine, the mechanics of calling the subroutine, unpacking the parameters, and allocating memory to accumulate the results; all combine to entirely swamp the actual thing you are trying to test.

      Adding a direct test of each of the two possibilities to your benchmark dramatically shows up the cost of all that indirection, as well as confirming your original suspicion that arrays are 37% faster than hashes:

      use strict; use warnings; use Benchmark; use List::Util qw(shuffle); my $count=5000000; my @existing_indexes = ( 5,6,8..53,62..106); our @indexes = (0..200); our @shuffled_indexes = shuffle @indexes; our %test_hash; our @test_array; @test_array[ @existing_indexes ] = (); @test_hash{ @existing_indexes } = (); Benchmark::cmpthese( -1, { AviaSub => q{ my $val = &arrayLookup( \@test_array, \@indexes ); }, HviaSub => q{ my $val = &hashLookup( \%test_hash, \@indexes ); }, AviaSubS => q{ my $val = &arrayLookup( \@test_array, \@shuffled_indexes ); }, HviaSubS => q{ my $val = &hashLookup( \%test_hash, \@shuffled_indexes ) }, Adirect => q{ my $count = 0; exists $test_array[ $_ ] and ++$count for @indexes; }, Hdirect => q{ my $count = 0; exists $test_hash{ $_ } and ++$count for @indexes; }, } ); sub arrayLookup { my $array = shift; my $indexes = shift; my @results = map { exists $array->[$_]; } @$indexes; return \@results; } sub hashLookup { my $hash = shift; my $keys = shift; my @results = map { exists $hash->{$_}; } @$keys; return \@results; }

      The results:

      c:\test>junk3 Rate HviaSub HviaSubS AviaSub AviaSubS Hdirect Adirect HviaSub 2085/s -- -1% -2% -5% -85% -89% HviaSubS 2113/s 1% -- -1% -4% -85% -89% AviaSub 2131/s 2% 1% -- -3% -85% -89% AviaSubS 2203/s 6% 4% 3% -- -84% -89% Hdirect 14147/s 578% 569% 564% 542% -- -27% Adirect 19366/s 829% 816% 809% 779% 37% --

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The first problem is that your arrays are lexical, but you are using strings not sub for your tests. That means that they cannot not be captured as closures. So, you are effectively testing empty arrays and hashes which is why the iteration counts are so high. To test this, insert a print statement inside the first test and run the test for a count of 1:

      ... Benchmark::cmpthese( 1, { arrayLookup => q{ print scalar( @test_array ), ' : ', scalar( @indexes ); my $val = &arrayLookup( \@test_array, \@indexes ); }, ... __OUPUT__ c:\test>junk3 0 : 0 (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate shuffled_arrayLookup shuffled_ +hashLookup hashLookup arrayLookup shuffled_arrayLookup 1000000000000000/s -- + 0% 0% 0% shuffled_hashLookup 1000000000000000/s 0% + -- 0% 0% hashLookup 1000000000000000/s 0% + 0% -- 0% arrayLookup 1000000000000000/s 0% + 0% 0% --

      If you are going to use strings, then your external variables have to be declared with our not my. Doing that and re-running the above sanity check gives:

      c:\test>junk3 107 : 201 (warning: too few iterations for a reliable count +) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate shuffled_arrayLookup shuffled_ +hashLookup hashLookup arrayLookup shuffled_arrayLookup 1000000000000000/s -- + 0% 0% 0% shuffled_hashLookup 1000000000000000/s 0% + -- 0% 0% hashLookup 1000000000000000/s 0% + 0% -- 0% arrayLookup 1000000000000000/s 0% + 0% 0% --

      Running the benchmark once that change is made gives:

      c:\test>junk3 Rate shuffled_hashLookup hashLookup arrayLookup + shuffled_arrayLookup shuffled_hashLookup 2113/s -- -0% -3% + -5% hashLookup 2113/s 0% -- -3% + -5% arrayLookup 2181/s 3% 3% -- + -2% shuffled_arrayLookup 2233/s 6% 6% 2% + --

      Not a whole heap of difference in the percentages, but take a close look at the two orders of magnitude drop in the number of iterations/second.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How do I efficiently predeclare a hash with a set of keys
by Fletch (Bishop) on May 18, 2007 at 19:34 UTC

    You may be interested in Set::IntSpan from CPAN which will read your sample data format ("5,6,8-53,62-106") directly.

Re: How do I efficiently predeclare a hash with a set of keys
by FunkyMonk (Bishop) on May 18, 2007 at 21:42 UTC
    my %hash = map { $_ => 1 } 5, 6, 8 .. 53, 62 .. 106;

    TIMTOWTDI