in reply to How good is gzip data as digest?

Rather than mucking with your keys you could use an on-disk hash (e.g. GDBM_File and friends, or BerkeleyDB) instead.

The cake is a lie.
The cake is a lie.
The cake is a lie.

Replies are listed 'Best First'.
Re^2: How good is gzip data as digest?
by derby (Abbot) on Apr 02, 2009 at 17:42 UTC

    I'd have to agree with Fletch. Also, compression algorithms normally produce binary data which would cause issues when used in string context (null characters) leading to, at a minimum, collisions. Also I don't think MD5 or SHA1 will buy you much savings giving the string lengths you're talking about.

    -derby

    Update: As jethro points out ... I was confusing C strings with perl strings ... silly me.

      Except that perl doesn't mind null characters in strings. Don't confuse them with C strings.
      > perl -e '$str="bla\x00bla"; print length($str),$str,"\n";' 7blabla
      "Also, compression algorithms normally produce binary data which would cause issues when used in string context (null characters) leading to, at a minimum, collisions"
      But - if I get you right - for "seen-lookups" without ever tackling null characters, and eval{}-ed in - it should work?

      "Also I don't think MD5 or SHA1 will buy you much savings giving the string lengths you're talking about."
      Right. Some strings are even shorter than the 32bit MD5's. So the idea of not blowing up (MD5) but even compressing down (gzip) came up.
        32bit MD5's

        128-bit, actually.

Re^2: How good is gzip data as digest?
by isync (Hermit) on Apr 02, 2009 at 17:45 UTC
    I know about tie-ing a disk-based hash, but I was mute about this option in my post as my scenario needs to be fast(!) and I can't deal with usual disk-lookup-times - thus the hash needs to stay in memory and has to be as small as possible...

      Ooh, another constraint; that helps. :) If false positives aren't an issue perhaps a Bloom filter (see also Bloom::Filter)?

      But zipping purely random data this short probably isn't going to be worthwhile (I get the compressed version being around 115% of the input; the more regularity within individual keys the better performance you'll get though so take that as a worst case upper bound).

      use IO::Compress::Gzip qw(gzip); use constant NUM_REPS => 100; my $num_reps = shift || NUM_REPS(); print "Trying ", $num_reps, " random strings\n"; my %lengths; for ( 1 .. $num_reps ) { my $in = join( "", map { chr( int rand(256) ) } 0 .. ( rand(100) + 10 +0 ) ); gzip \$in => \$out; my ( $li, $lo ) = map length, ( $in, $out ); $lengths{$lo}++; $pct = $lo / $li * 100.0; $avg += $pct; printf "%d\t=>\t%d\t%4.3f%%\n", $li, $lo, $pct if $ENV{SHOW}; } printf "avg size after compression: %4.3f%%\n", $avg / $num_reps; if ( $ENV{SHOW_HIST} ) { print "Length distribution\n"; for my $len ( sort { $a <=> $b } keys %lengths ) { print "$len\t$lengths{ $len }\n"; } } exit 0; __END__ $ perl comptest 5000 Trying 5000 random strings avg size after compression: 115.864% $ SHOW=1 perl comptest 5 Trying 5 random strings 104 => 127 122.115% 173 => 196 113.295% 171 => 194 113.450% 145 => 168 115.862% 190 => 213 112.105% avg size after compression: 115.366%

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

      isync:

      Okay, then you could always amortize the disk lookup by using a fragment of the digest value as a hash key to keep the size small, then you'd need only reference the disk when you have a collision. It's yet another level of crunching, but might save you enough RAM and speed to get the performance you need.

      HOWEVER: Have you actually measured the performance? It would be a pity to waste all this time thinking about it if the disk-based hash would be, in fact, fast enough to serve the purpose.

      Remember: First make it work, then make it fast...

      roboticus
        Actually I did measure performance, and it's significant. But I like your idea of hash-fractions - will be in the next iteration of my lookup-hash algorithm for a test-drive.