in reply to Re: Counting keys with defined or undefined elements in a hash
in thread Counting keys with defined or undefined elements in a hash

My aesthetic sense is somewhat offended by scanning the list twice using grep. I would rather scan the list once and count both the defined and undefined in a single pass. Not to mention that fact that grep constructs a list, that you then throw away after having counted the number of elements.

The code below uses the result of defined to index an array. The array contains two elements, the number of undefined and the number of defined elements.

! /usr/bin/perl -w use strict; my %h = ( a => 1, b => undef, c => 1, d => 1, e => undef, ); my @def; $def[ defined $_ ? 1 : 0]++ for values %h; print "undefined=$def[0], defined=$def[1]\n"; __END__ produces: undefined=2, defined=3

(A short while later)...

Note that in both broquaint's code and mine, you are still iterating over the hash. There really is no way around that. You could, as the AM says, keep track of what you insert as you go along. But that approach offers you far more ways to go introduce bugs and would be far messier in terms of code. Or if you are brave and insist on this approach, you could try some fancy footwork with tie.

Just iterate and be done with it.

...(some time later)...

Just for kicks, I rolled a benchmark out of the two approaches:

except I screwed up

#! /usr/bin/perl -w use strict; use Benchmark qw/cmpthese/; my %h; my $nr = shift || 10000000; $h{$nr} = ((rand() > 0.2) ? 1 : undef) while $nr--; sub by_for { my $h = shift; my @def; $def[ defined $_ ? 1 : 0]++ for values %$h; @def; } sub by_grep { my $h = shift; ( scalar (grep !defined, values %$h), scalar (grep defined, values %$h), ); } printf "by_for: undefined=%d, defined=%d\n", by_for( \%h ); printf "by_grep: undefined=%d, defined=%d\n", by_grep( \%h ); cmpthese( -5, { for => ' by_for( \%h ) ', grep => ' by_grep( \%h ) ', } );

Be careful! As it stands, this code creates a hash with 10 million elements. On my machine it soaks up 800Mb of RAM. The results show that the scalar grep, somewhat to my surprise, consistently outperforms the for approach:

by_for: undefined=2000300, defined=7999700 by_grep: undefined=2000300, defined=7999700 Benchmark: running for, grep for at least 5 CPU seconds... for: 5 wallclock secs ( 5.27 usr + 0.00 sys = 5.27 CPU) @ 16 +3521.04/s (n=862318) grep: 6 wallclock secs ( 5.05 usr + 0.00 sys = 5.05 CPU) @ 22 +4222.12/s (n=1131621) Rate for grep for 163521/s -- -27% grep 224222/s 37% --

Still, iterating over a 10_000_000 element list more than 160 thousand times a second isn't too shabby a performance...

...(and then quite some time later)...

Pondering the results of my benchmark, I realised something was flawed. Calculating how much memory was involved compared to how many iterations were performed implied a bus speed tending towards the exabyte to zettabyte per second range... which meant that the benchmark wasn't testing what I thought it was testing. Indeed, jsprat arrived at the same conclusion independently.

Here, then, is a corrected benchmark that produces results that are far more reasonable. I just changed the cmpthese function as follows:

cmpthese( -300, { byfor => sub { by_for( \%h ) }, bygrep => sub { by_grep( \%h ) }, } );

(Yes, I had to up this to 300 seconds of CPU time to get good results)... which gives:

by_for: undefined=2398233, defined=9601767 by_grep: undefined=2398233, defined=9601767 Benchmark: running byfor, bygrep for at least 300 CPU seconds... byfor: 311 wallclock secs (311.09 usr + 0.03 sys = 311.12 CPU) @ + 0.06/s (n=20) bygrep: 313 wallclock secs (312.95 usr + 0.04 sys = 312.99 CPU) @ + 0.04/s (n=13) s/iter bygrep byfor bygrep 24.1 -- -35% byfor 15.6 55% --

This is running on a 12 million element hash. This consumes 900Mb on my machine. (I forgot I had compiled the kernel to limit processes to consuming more than a gig, otherwise I could have pushed the size of the hash up further).

I should also note that this was run on 5.8.0, so I'm getting the benefit of the free scalar grep.

At this size the for approach has started to pull away from grep. Which approach you should use therefore depends on how big the hashes are :)

...(and yet still later after I read some more in the thread)... I included the difference algorithm (and I'm kicking myself for not having thought of it -- one more reason why this site is such a great place) which pulls way ahead. The changes are

sub by_diff { my $h = shift; my $defined = grep defined, values %$h; ( scalar keys(%$h) - $defined, $defined, ); } printf "by_for: undefined=%d, defined=%d\n", by_for( \%h ); printf "by_grep: undefined=%d, defined=%d\n", by_grep( \%h ); printf "by_diff: undefined=%d, defined=%d\n", by_diff( \%h ); cmpthese( -600, { byfor => sub { by_for( \%h ) }, bygrep => sub { by_grep( \%h ) }, bydiff => sub { by_diff( \%h ) }, } );

Which gives:

by_for: undefined=2398740, defined=9601260 by_grep: undefined=2398740, defined=9601260 by_diff: undefined=2398740, defined=9601260 Benchmark: running bydiff, byfor, bygrep for at least 600 CPU seconds. +.. bydiff: 606 wallclock secs (604.93 usr + 0.11 sys = 605.04 CPU) @ + 0.08/s (n=51) byfor: 625 wallclock secs (623.32 usr + 0.09 sys = 623.41 CPU) @ + 0.06/s (n=40) bygrep: 604 wallclock secs (602.22 usr + 0.07 sys = 602.29 CPU) @ + 0.04/s (n=25) s/iter bygrep byfor bydiff bygrep 24.1 -- -35% -51% byfor 15.6 55% -- -24% bydiff 11.9 103% 31% --

So there you have it. At this late stage of the game, if performance is still a worry, you should start thinking about writing in C, or using a database.

Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.