Re^2: How good is gzip data as digest?

Replies are listed 'Best First'.
Re^3: How good is gzip data as digest? by Fletch (Bishop) on Apr 02, 2009 at 18:19 UTC
Ooh, another constraint; that helps. :) If false positives aren't an issue perhaps a Bloom filter (see also Bloom::Filter)? But zipping purely random data this short probably isn't going to be worthwhile (I get the compressed version being around 115% of the input; the more regularity within individual keys the better performance you'll get though so take that as a worst case upper bound). use IO::Compress::Gzip qw(gzip); use constant NUM_REPS => 100; my $num_reps = shift \|\| NUM_REPS(); print "Trying ", $num_reps, " random strings\n"; my %lengths; for ( 1 .. $num_reps ) { my $in = join( "", map { chr( int rand(256) ) } 0 .. ( rand(100) + 10 +0 ) ); gzip \$in => \$out; my ( $li, $lo ) = map length, ( $in, $out ); $lengths{$lo}++; $pct = $lo / $li * 100.0; $avg += $pct; printf "%d\t=>\t%d\t%4.3f%%\n", $li, $lo, $pct if $ENV{SHOW}; } printf "avg size after compression: %4.3f%%\n", $avg / $num_reps; if ( $ENV{SHOW_HIST} ) { print "Length distribution\n"; for my $len ( sort { $a <=> $b } keys %lengths ) { print "$len\t$lengths{ $len }\n"; } } exit 0; __END__ $ perl comptest 5000 Trying 5000 random strings avg size after compression: 115.864% $ SHOW=1 perl comptest 5 Trying 5 random strings 104 => 127 122.115% 173 => 196 113.295% 171 => 194 113.450% 145 => 168 115.862% 190 => 213 112.105% avg size after compression: 115.366% [download] The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l]
Re^3: How good is gzip data as digest? by roboticus (Chancellor) on Apr 02, 2009 at 20:18 UTC
isync: Okay, then you could always amortize the disk lookup by using a fragment of the digest value as a hash key to keep the size small, then you'd need only reference the disk when you have a collision. It's yet another level of crunching, but might save you enough RAM and speed to get the performance you need. *HOWEVER:* Have you actually measured the performance? It would be a pity to waste all this time thinking about it if the disk-based hash would be, in fact, fast enough to serve the purpose. Remember: First make it work, then make it fast... roboticus	[reply]
Re^4: How good is gzip data as digest? by isync (Hermit) on Apr 02, 2009 at 21:24 UTC
Actually I did measure performance, and it's significant. But I like your idea of hash-fractions - will be in the next iteration of my lookup-hash algorithm for a test-drive.	[reply]