amarquis has asked for the wisdom of the Perl Monks concerning the following question:

I need a random element from a huge hash, one large enough that using keys to get a list of elements cripples my system. (Why didn't I use an array? I didn't plan to have to do this, and most of what I'm doing with this thing involves pulling single records by name.)

I looked at the answers at Pulling random elements from a hash, but they involve the keys method too. I'm thinking what I'm going to do, since this only needs to run once, is iterate through once using each to find the total number of elements, generate unique random numbers in that range, and then iterate through again, grabbing those records as I go.

This seems gross, but I backed myself into the corner of having to do it. I was wondering, though, if some of the minds here had a cleaner solution.

Side question: perldoc states that, for a given run of perl and an unmodified hash, each, keys, and values are guaranteed to have the same ordering as each other. What about repeated iterations of the same type? For example, if I use keys, process the results without changing the hash, and then use keys again, do I get the same ordering? My testing says yes, but is it guaranteed?

Replies are listed 'Best First'.
Re: Random hash element (low memory edition).
by Roy Johnson (Monsignor) on Jan 24, 2008 at 19:29 UTC
    You can find the number of elements using keys in a scalar context, and it won't take time or space to go through the elements.

    But you could use the recipe for printing a random line from a file to just iterate through the hash without knowing how many elements it has and select a random one.

    Yes, as long as you don't change the hash, keys will return keys in the same order every time. values will return values in the corresponding order.


    Caution: Contents may have been coded under pressure.

      That method is awfully neat, but my perl skills are not so I'll hit you with a follow-up question.

      rand($.) < 1 && ($line = $_) while <>;

      So this iterates through the file and, if a random number from 0 to the current line number is less than 1, selects the current line, overwriting previous choices? So, in my case, I'd just keep a count variable for how many elements I've been through and use that in place of $.?

      I'm going to have to check out the proof of that, it is really interesting that everything evens out (i.e. the first elements have a high chance of being overwritten, the latter elements have a high chance of just not being picked, and it all balances out to a random selection.) Thanks for pointing me to that FAQ entry!

        So, in my case, I'd just keep a count variable for how many elements I've been through and use that in place of $.?

        Exactly.


        Caution: Contents may have been coded under pressure.
Re: Random hash element (low memory edition).
by BrowserUk (Patriarch) on Jan 24, 2008 at 20:06 UTC

    You could use something like this:

    Code updated per ikegami's post below.

    sub getRandomPairFromHash{ my $href = shift; my $nKeys = keys %$href; my $n = 1 + int rand $nKeys; my( $key, $val ); ( $key, $val ) = each %$href while $n--; return ( $key, $val ) }

    It does work and doesn't increase the memory consumption, but fast it's not.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      That fails when $n == 0 and when $n == 1. One way to fix it is to apply both of the following changes:

      Change
      while --$n
      to
      while $n--

      Change
      my $n = int rand $nKeys;
      to
      my $n = 1 + int rand $nKeys;

      Update: Might as well post the modified code:

      sub getRandomPairFromHash{ my $href = shift; my $nKeys = keys %$href or return ( undef, undef ); my $n = 1 + int rand $nKeys; my( $key, $val ); ( $key, $val ) = each %$href while $n--; return ( $key, $val ) }
Re: Random hash element (low memory edition).
by dave_the_m (Monsignor) on Jan 24, 2008 at 22:52 UTC
    The order in which keys are returned by keys/each/value are supposed to be random (not crypotgraphically random, but explicity coded as random, with the random factor being different for each run (see PERL_HASH_SEED).

    So why not just grab the first element of the hash? Or if you need more than one, grab the first N? Or you need a single random element from time to time, either maintain a counter and grab element $n++ using the methods discussed eaither in this thread, or if you can guarantee that the hash's iterator won't be used elsewhere, you can use 'each' to get the next element.

    Dave.

      The order in which keys are returned by keys/each/value are supposed to be random (not crypotgraphically random, but explicity coded as random, with the random factor being different for each run (see PERL_HASH_SEED).

      If they are, then my perl is quite buggy.

      I ran perl -MData::Dumper -we 'print Dumper \%ENV' a few times, both with perl 5.8.8 and 5.10.0 (on Debian Etch i386), and I always got the same order.

      (And this is no Data::Dumper artifact - it works with join ", ", keys %ENV as well.)

      It is true that you shouldn't rely on a particular hash order, but it doesn't mean a randomized order is guaranteed.

        My mistake. The first cut of the hash randomisation code always randomised, but then it was changed to only randomise in pathological cases, to avoid too much code breakage.

        Dave.

      The order in which keys are returned by keys/each/value are supposed to be random (not crypotgraphically random, but explicity coded as random, with the random factor being different for each run (see PERL_HASH_SEED).
      Uh, it may be different each time you run the script, but if you iterate through the same hash twice in your script, you'll get the same order twice.
Re: Random hash element (low memory edition).
by bart (Canon) on Jan 27, 2008 at 12:49 UTC
    You can adapt the code from perlfaq5 (and, like it says, the Camel book) How do I select a random line from a file? for when iterating through your hash, once, instead of through the lines of a file.
    my $chosen; my $n = 0; while(my($k, $v) = each %hash) { rand(++$n) < 1 and $chosen = $v; } return $chosen;
        The advantage? I'm not sure there is one. It is a different approach than the one proposed by you in Re: Random hash element (low memory edition)., and it is not worse. That is interesting by itself.

        It could even be beneficial in case of a tied hash, as a call to keys makes it loop through all the keys of the tied hash, just to get the count. So, in that case, your code will loop through them at least once, to get the keys, and a fraction of a cycle (up to a complete cycle) more than that. I assume that my approach could be faster, in that case.

        But note that the Big O is still the same for both approaches: O(n), where n is the number of hash keys.

        It might become more interesting if one wants a weighted probability, then in my code, you "just" have to replace the test condition by a different function of rand() and a weight that depends on the current key. But the structure of the code can remain the same. That's a huge advantage.

        I have derived a usable function in a journal entry on use.perl.org.

        With weighted probabilities (the weight is a positive number that is proportional to the desired odds that this item is picked; if it's zero, the item is skipped), the code can be:

        my($threshold, $value); while(my($k, $v) = each(%hash)) { if(my $weight = weight($k)) { # function call my $rand = -log(1 - rand)/$weight; if(not defined $threshold or $rand < $threshold) { $threshold = $rand; $value = $v; } } }
        }