in reply to How good is gzip data as digest?

You won't get any collision because obviously packing is a bidirectional transformation, otherwise gzip wouldn't know into which version it should unpack.

But if your strings don't have lots of repetition in them packing won't get those strings very small. Gzip works best with really long data

A digest on the other hand might have collisions but will get your string down to a definite length. The collisions are VERY unlikely with a good hash, but then the time to calculate the hash may eat some of the speed you hope to get out of a memory based solution

Replies are listed 'Best First'.
Re^2: How good is gzip data as digest?
by isync (Hermit) on Apr 02, 2009 at 18:05 UTC
    A "bidirectional transformation" - exactly what I thought! But I did not think about that compressing of short strings might not buy me much! So thank you for that!
    In regards to using a digest: I think I will have to benchmark some possible solutions (sigh) and measure how well compression on my type/length of strings actually is. I will post an update with my results if I find the time to do that... Anyway, thank you all for the quick thoughts!

    Update:
    It seems my strings reach a break-even on lengths of around 120. Strings shorter than that are compressed actually longer than the original. (Exactly what Fletch observed) I think I will have to look somewhere else for a gain...
      isync:

      If you're currently using an ASCII version of an MD5 digest, you could easily compress 1/3 of the space out by converting it to binary and then using BASE64 to convert to a string. (Or, even better, do it directly.) The ASCII version uses a byte to hold 4 bits of information, while BASE64 will store 6 bits of info in a byte. If you don't mind using a binary string, that would let you compress to 50% of the original size.

      Other than that, I wouldn't expect much. Compression routines (at least the ones I'm familiar with) simply squash repetitive patterns out of the data. A digest tends to spread values evenly through the result space, so there shouldn't be much repetition for a compression algorithm to take advantage of.

      ...roboticus