isync has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am construction a simple seen/lookup hash
if($defined($seen{$key})){ next; }else{ $seen{$key} = 1; next; }
to get rid of duplicates. Anyway, as my hash-structure will get quite large, I need some tricks to reduce the size of the in-memory footprint. Strings lenghts are around 64-256. My idea was to compress data prior to looking it up in the $seen hash.

Now, is a gzipped version of a string a good idea as a Digest, compared to, let's say Digest::MD5 or Sha1? Or will I encounter lots of collisions? Anyone did that?

Replies are listed 'Best First'.
Re: How good is gzip data as digest?
by ikegami (Patriarch) on Apr 02, 2009 at 17:44 UTC
    I don't have a good answer to your question, but here's a tip:
    if(defined($seen{$key})){ next; }else{ $seen{$key} = 1; next; }
    can be written as
    if(!defined($seen{$key})){ $seen{$key} = 1; } next;
    Or the faster, more concise
    $seen{$key} = 1; next;
    Or the more informative
    ++$seen{$key}; next;
      Or the defined-or operator, more perl5.10ish ;-)
      $seen{$key} //= 1; next;

      print+qq(\L@{[ref\&@]}@{['@'x7^'!#2/"!4']});
        Why? That's a step backwards.
      :-)
Re: How good is gzip data as digest?
by Fletch (Bishop) on Apr 02, 2009 at 17:28 UTC

    Rather than mucking with your keys you could use an on-disk hash (e.g. GDBM_File and friends, or BerkeleyDB) instead.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      I'd have to agree with Fletch. Also, compression algorithms normally produce binary data which would cause issues when used in string context (null characters) leading to, at a minimum, collisions. Also I don't think MD5 or SHA1 will buy you much savings giving the string lengths you're talking about.

      -derby

      Update: As jethro points out ... I was confusing C strings with perl strings ... silly me.

        Except that perl doesn't mind null characters in strings. Don't confuse them with C strings.
        > perl -e '$str="bla\x00bla"; print length($str),$str,"\n";' 7blabla
        "Also, compression algorithms normally produce binary data which would cause issues when used in string context (null characters) leading to, at a minimum, collisions"
        But - if I get you right - for "seen-lookups" without ever tackling null characters, and eval{}-ed in - it should work?

        "Also I don't think MD5 or SHA1 will buy you much savings giving the string lengths you're talking about."
        Right. Some strings are even shorter than the 32bit MD5's. So the idea of not blowing up (MD5) but even compressing down (gzip) came up.
      I know about tie-ing a disk-based hash, but I was mute about this option in my post as my scenario needs to be fast(!) and I can't deal with usual disk-lookup-times - thus the hash needs to stay in memory and has to be as small as possible...

        Ooh, another constraint; that helps. :) If false positives aren't an issue perhaps a Bloom filter (see also Bloom::Filter)?

        But zipping purely random data this short probably isn't going to be worthwhile (I get the compressed version being around 115% of the input; the more regularity within individual keys the better performance you'll get though so take that as a worst case upper bound).

        use IO::Compress::Gzip qw(gzip); use constant NUM_REPS => 100; my $num_reps = shift || NUM_REPS(); print "Trying ", $num_reps, " random strings\n"; my %lengths; for ( 1 .. $num_reps ) { my $in = join( "", map { chr( int rand(256) ) } 0 .. ( rand(100) + 10 +0 ) ); gzip \$in => \$out; my ( $li, $lo ) = map length, ( $in, $out ); $lengths{$lo}++; $pct = $lo / $li * 100.0; $avg += $pct; printf "%d\t=>\t%d\t%4.3f%%\n", $li, $lo, $pct if $ENV{SHOW}; } printf "avg size after compression: %4.3f%%\n", $avg / $num_reps; if ( $ENV{SHOW_HIST} ) { print "Length distribution\n"; for my $len ( sort { $a <=> $b } keys %lengths ) { print "$len\t$lengths{ $len }\n"; } } exit 0; __END__ $ perl comptest 5000 Trying 5000 random strings avg size after compression: 115.864% $ SHOW=1 perl comptest 5 Trying 5 random strings 104 => 127 122.115% 173 => 196 113.295% 171 => 194 113.450% 145 => 168 115.862% 190 => 213 112.105% avg size after compression: 115.366%

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

        isync:

        Okay, then you could always amortize the disk lookup by using a fragment of the digest value as a hash key to keep the size small, then you'd need only reference the disk when you have a collision. It's yet another level of crunching, but might save you enough RAM and speed to get the performance you need.

        HOWEVER: Have you actually measured the performance? It would be a pity to waste all this time thinking about it if the disk-based hash would be, in fact, fast enough to serve the purpose.

        Remember: First make it work, then make it fast...

        roboticus
Re: How good is gzip data as digest?
by jethro (Monsignor) on Apr 02, 2009 at 17:54 UTC

    You won't get any collision because obviously packing is a bidirectional transformation, otherwise gzip wouldn't know into which version it should unpack.

    But if your strings don't have lots of repetition in them packing won't get those strings very small. Gzip works best with really long data

    A digest on the other hand might have collisions but will get your string down to a definite length. The collisions are VERY unlikely with a good hash, but then the time to calculate the hash may eat some of the speed you hope to get out of a memory based solution

      A "bidirectional transformation" - exactly what I thought! But I did not think about that compressing of short strings might not buy me much! So thank you for that!
      In regards to using a digest: I think I will have to benchmark some possible solutions (sigh) and measure how well compression on my type/length of strings actually is. I will post an update with my results if I find the time to do that... Anyway, thank you all for the quick thoughts!

      Update:
      It seems my strings reach a break-even on lengths of around 120. Strings shorter than that are compressed actually longer than the original. (Exactly what Fletch observed) I think I will have to look somewhere else for a gain...
        isync:

        If you're currently using an ASCII version of an MD5 digest, you could easily compress 1/3 of the space out by converting it to binary and then using BASE64 to convert to a string. (Or, even better, do it directly.) The ASCII version uses a byte to hold 4 bits of information, while BASE64 will store 6 bits of info in a byte. If you don't mind using a binary string, that would let you compress to 50% of the original size.

        Other than that, I wouldn't expect much. Compression routines (at least the ones I'm familiar with) simply squash repetitive patterns out of the data. A digest tends to spread values evenly through the result space, so there shouldn't be much repetition for a compression algorithm to take advantage of.

        ...roboticus
Re: How good is gzip data as digest?
by BrowserUk (Patriarch) on Apr 02, 2009 at 21:58 UTC

    If you use the binary MD5 as your keys, you can expect around a 33% saving in the size of the memory requirements. Assuming ascii keys. Maybe more if utf.

    Ignore the "two CRCs" version. I'm not sure about the reliability of the collisions and it doesn't buy you a lot of space. The below is using a 64-bit Perl, so YMWV if you're using 32-bit.

    #! perl -sw use strict; use 5.010; use Digest::MD5 qw[ md5 ]; use String::CRC32; use Devel::Size qw[ total_size ]; open IN, '<', 'randStr-1M(64-254).dat' or die $!; my %asc; chomp, ++$asc{ $_ } while <IN>; printf "%07d Ascii keys: %.f\n", scalar keys( %asc ), total_size( \%as +c ); undef %asc; seek IN, 0, 0; my %md5; chomp, undef( $md5{ md5( $_ ) } ) while <IN>; printf "%07d binary MD5 keys: %.f\n", scalar keys( %md5 ), total_size +( \%md5 ); undef %md5; seek IN, 0, 0; my %crc; chomp, undef( $crc{ pack 'VV', crc32( $_ ), crc32( scalar reverse $_ ) + } ) while <IN>; printf "%07d binary CRC keys: %.f\n", scalar keys( %crc ), total_size +( \%crc ); __END__ c:\test>bigHash.pl 1000000 Ascii keys: 53879053 1000000 binary MD5 keys: 35766510 1000000 binary CRC keys: 34756892

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.