in reply to Simple Digests?

CRC is a Cyclic Redundancy Check - it is a fast, reasonably reliable way to ensure that a message was not damaged in transit.

GrandFather makes an excellent point about uniquness. Digests are very good for integrity checking, because the odds are very low that a substantially similar message will result in the same digest. However, you are reducing a message of arbitrary size down to a limited-size digest (16 bytes, for MD5). There will absolutely be collisions, it's a question of when they will occur.

Now, if you need to uniquely identify e-mail messages to use as keys, there are a few ways. One is to use an additional "pretty unique" attribute of the E-Mail message, and append that to the hash string. For example, the Message-ID: SMTP header should be a unique value anyway, but combined with a digest of the entire message, the chance of collision is essentially zero.

For example, if the Message-ID was <907068073421@smtp.yourhost.com>, and the Digest of the message was 5eb63bbbe01eeed093cb22bb8f5acdc3 (the MD5 of "hello world", if you care). Your key might be 907068073421@smtp.yourhost.com||5eb63bbbe01eeed093cb22bb8f5acdc3 -- that's pretty likely to be unique!

You could also use something like Data::UUID to associate the message with a truly unique identifier. I don't know if this would work for your application, because I don't know your requirements. I'm guessing you wish to be able to derive the key given the message? If so, than Data::UUID won't work for you.

<-radiant.matrix->
A collection of thoughts and links from the minds of geeks
The Code that can be seen is not the true Code
I haven't found a problem yet that can't be solved by a well-placed trebuchet

Replies are listed 'Best First'.
Re^2: Simple Digests?
by pileofrogs (Priest) on Mar 28, 2006 at 18:54 UTC

    Thanks for the response!

    Actually, I need to be able to use any data as the key. Since the data needs to be storable in standard DB file formats, I need a way to handle relatively long data. One possible example would be an entire email. It could also be, I dunno, a JPEG file...

      Ah. That's a tough one. Digests can help, but since two very different sets of data can share a digest... well, it won't be unique. You could do something like a digest prepended with the nth 10 bytes from the file, or something, but even that wouldn't be a guarantee.

      It comes down to finding something that's unique enough about the data that when it's combined with a digest, you have a "pretty much guaranteed" unique key. MIME type, maybe? Combined with a longer digest (say, SHA-256), that would be pretty good.

      Two different digests (say, MD5().SHA256()) would be a likely candidate, too -- the chance that data "A" will have a digest collision with data "B" in two different digest systems is fanstastically small.

      <-radiant.matrix->
      A collection of thoughts and links from the minds of geeks
      The Code that can be seen is not the true Code
      I haven't found a problem yet that can't be solved by a well-placed trebuchet

        I really don't need that much protection from collisions. It's not like I need to worry about someone intentionally trying to create a collision. Instead of a 1 in a Zillion odds of a collision, I probably could get by with 1 in a thousand.

      Are you retrieving it just by the key? If you're using some other data to get the key, you could use a timestamp prefixed to the md5.
      Seems like it would be pretty tough to have a collision

      -Lee
      "To be civilized is to deny one's nature."

        Freaking Brilliant!