in reply to Re^2: How good is gzip data as digest?
in thread How good is gzip data as digest?
Ooh, another constraint; that helps. :) If false positives aren't an issue perhaps a Bloom filter (see also Bloom::Filter)?
But zipping purely random data this short probably isn't going to be worthwhile (I get the compressed version being around 115% of the input; the more regularity within individual keys the better performance you'll get though so take that as a worst case upper bound).
use IO::Compress::Gzip qw(gzip); use constant NUM_REPS => 100; my $num_reps = shift || NUM_REPS(); print "Trying ", $num_reps, " random strings\n"; my %lengths; for ( 1 .. $num_reps ) { my $in = join( "", map { chr( int rand(256) ) } 0 .. ( rand(100) + 10 +0 ) ); gzip \$in => \$out; my ( $li, $lo ) = map length, ( $in, $out ); $lengths{$lo}++; $pct = $lo / $li * 100.0; $avg += $pct; printf "%d\t=>\t%d\t%4.3f%%\n", $li, $lo, $pct if $ENV{SHOW}; } printf "avg size after compression: %4.3f%%\n", $avg / $num_reps; if ( $ENV{SHOW_HIST} ) { print "Length distribution\n"; for my $len ( sort { $a <=> $b } keys %lengths ) { print "$len\t$lengths{ $len }\n"; } } exit 0; __END__ $ perl comptest 5000 Trying 5000 random strings avg size after compression: 115.864% $ SHOW=1 perl comptest 5 Trying 5 random strings 104 => 127 122.115% 173 => 196 113.295% 171 => 194 113.450% 145 => 168 115.862% 190 => 213 112.105% avg size after compression: 115.366%
The cake is a lie.
The cake is a lie.
The cake is a lie.
|
|---|