Ooh, another constraint; that helps. :) If false positives aren't an issue perhaps a Bloom filter (see also Bloom::Filter)?
But zipping purely random data this short probably isn't going to be worthwhile (I get the compressed version being around 115% of the input; the more regularity within individual keys the better performance you'll get though so take that as a worst case upper bound).
use IO::Compress::Gzip qw(gzip);
use constant NUM_REPS => 100;
my $num_reps = shift || NUM_REPS();
print "Trying ", $num_reps, " random strings\n";
my %lengths;
for ( 1 .. $num_reps ) {
my $in
= join( "", map { chr( int rand(256) ) } 0 .. ( rand(100) + 10
+0 ) );
gzip \$in => \$out;
my ( $li, $lo ) = map length, ( $in, $out );
$lengths{$lo}++;
$pct = $lo / $li * 100.0;
$avg += $pct;
printf "%d\t=>\t%d\t%4.3f%%\n", $li, $lo, $pct if $ENV{SHOW};
}
printf "avg size after compression: %4.3f%%\n", $avg / $num_reps;
if ( $ENV{SHOW_HIST} ) {
print "Length distribution\n";
for my $len ( sort { $a <=> $b } keys %lengths ) {
print "$len\t$lengths{ $len }\n";
}
}
exit 0;
__END__
$ perl comptest 5000
Trying 5000 random strings
avg size after compression: 115.864%
$ SHOW=1 perl comptest 5
Trying 5 random strings
104 => 127 122.115%
173 => 196 113.295%
171 => 194 113.450%
145 => 168 115.862%
190 => 213 112.105%
avg size after compression: 115.366%
The cake is a lie.
The cake is a lie.
The cake is a lie.
|