in reply to Re: How good is gzip data as digest?
in thread How good is gzip data as digest?

I know about tie-ing a disk-based hash, but I was mute about this option in my post as my scenario needs to be fast(!) and I can't deal with usual disk-lookup-times - thus the hash needs to stay in memory and has to be as small as possible...

Replies are listed 'Best First'.
Re^3: How good is gzip data as digest?
by Fletch (Bishop) on Apr 02, 2009 at 18:19 UTC

    Ooh, another constraint; that helps. :) If false positives aren't an issue perhaps a Bloom filter (see also Bloom::Filter)?

    But zipping purely random data this short probably isn't going to be worthwhile (I get the compressed version being around 115% of the input; the more regularity within individual keys the better performance you'll get though so take that as a worst case upper bound).

    use IO::Compress::Gzip qw(gzip); use constant NUM_REPS => 100; my $num_reps = shift || NUM_REPS(); print "Trying ", $num_reps, " random strings\n"; my %lengths; for ( 1 .. $num_reps ) { my $in = join( "", map { chr( int rand(256) ) } 0 .. ( rand(100) + 10 +0 ) ); gzip \$in => \$out; my ( $li, $lo ) = map length, ( $in, $out ); $lengths{$lo}++; $pct = $lo / $li * 100.0; $avg += $pct; printf "%d\t=>\t%d\t%4.3f%%\n", $li, $lo, $pct if $ENV{SHOW}; } printf "avg size after compression: %4.3f%%\n", $avg / $num_reps; if ( $ENV{SHOW_HIST} ) { print "Length distribution\n"; for my $len ( sort { $a <=> $b } keys %lengths ) { print "$len\t$lengths{ $len }\n"; } } exit 0; __END__ $ perl comptest 5000 Trying 5000 random strings avg size after compression: 115.864% $ SHOW=1 perl comptest 5 Trying 5 random strings 104 => 127 122.115% 173 => 196 113.295% 171 => 194 113.450% 145 => 168 115.862% 190 => 213 112.105% avg size after compression: 115.366%

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re^3: How good is gzip data as digest?
by roboticus (Chancellor) on Apr 02, 2009 at 20:18 UTC
    isync:

    Okay, then you could always amortize the disk lookup by using a fragment of the digest value as a hash key to keep the size small, then you'd need only reference the disk when you have a collision. It's yet another level of crunching, but might save you enough RAM and speed to get the performance you need.

    HOWEVER: Have you actually measured the performance? It would be a pity to waste all this time thinking about it if the disk-based hash would be, in fact, fast enough to serve the purpose.

    Remember: First make it work, then make it fast...

    roboticus
      Actually I did measure performance, and it's significant. But I like your idea of hash-fractions - will be in the next iteration of my lookup-hash algorithm for a test-drive.