Simple Digests?

pileofrogs has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Simple Digests? by GrandFather (Saint) on Mar 27, 2006 at 23:30 UTC
The question you need to ask is "How unique has the hash key to be?". If the answer is "Absolutely", then you have a problem, you need to store almost as many bits as there are in the message (compression helps though). Otherwise the number of bits of uniqueness dictates what you can do. A simple CRC type checksum of an appropriate number of bits is likely cheaper to calculate than an MD5. DWIM is Perl's answer to Gödel	[reply]
Re: Simple Digests? by swampyankee (Parson) on Mar 27, 2006 at 23:24 UTC
Page 237 of my copy of Programming Perl (2d edition) has this: `undef $/; # That's what's in the book; local $/ may be better. $checksum = unpack("%32C",<>) %32767;` [download] which is claimed to produce "the same number as the System V sum* program". This may be supercheap; it's certainly pretty short. emc " The most likely way for the world to be destroyed, most experts agree, is by accident. That's where we come in; we're computer professionals. We cause accidents." —Nathaniel S. Borenstein	[reply] [d/l]
Re^2: Simple Digests? by wazoox (Prior) on Mar 29, 2006 at 15:59 UTC
funny, in perldoc -f unpack we have : `For example, the following computes the same number as the Sys- tem V sum program: $checksum = do { local $/; # slurp! unpack("%32C*",<>) % 65535; };` [download] So it gained one more bit in the meanwhile :)	[reply] [d/l]
Re^3: Simple Digests? by swampyankee (Parson) on Mar 29, 2006 at 20:55 UTC
That is funny. Now, I'm wondering why.... emc "Being forced to write comments actually improves code, because it is easier to fix a crock than to explain it. " —G. Steele	[reply]
Re: Simple Digests? by radiantmatrix (Parson) on Mar 28, 2006 at 18:16 UTC
CRC is a Cyclic Redundancy Check - it is a fast, reasonably reliable way to ensure that a message was not damaged in transit. GrandFather makes an excellent point about uniquness. Digests are very good for integrity checking, because the odds are very low that a substantially similar message will result in the same digest. However, you are reducing a message of arbitrary size down to a limited-size digest (16 bytes, for MD5). There will absolutely be collisions, it's a question of when they will occur. Now, if you need to uniquely identify e-mail messages to use as keys, there are a few ways. One is to use an additional "pretty unique" attribute of the E-Mail message, and append that to the hash string. For example, the `Message-ID:` SMTP header should be a unique value anyway, but combined with a digest of the entire message, the chance of collision is essentially zero. For example, if the Message-ID was `<907068073421@smtp.yourhost.com>`, and the Digest of the message was `5eb63bbbe01eeed093cb22bb8f5acdc3` (the MD5 of "hello world", if you care). Your key might be `907068073421@smtp.yourhost.com\|\|5eb63bbbe01eeed093cb22bb8f5acdc3` -- that's pretty likely to be unique! You could also use something like Data::UUID to associate the message with a truly unique identifier. I don't know if this would work for your application, because I don't know your requirements. I'm guessing you wish to be able to derive the key given the message? If so, than Data::UUID won't work for you. <-radiant.matrix-> A collection of thoughts and links from the minds of geeks The Code that can be seen is not the true Code I haven't found a problem yet that can't be solved by a well-placed trebuchet	[reply] [d/l] [select]
Re^2: Simple Digests? by pileofrogs (Priest) on Mar 28, 2006 at 18:54 UTC
Thanks for the response! Actually, I need to be able to use any data as the key. Since the data needs to be storable in standard DB file formats, I need a way to handle relatively long data. One possible example would be an entire email. It could also be, I dunno, a JPEG file...	[reply]
Re^3: Simple Digests? by radiantmatrix (Parson) on Mar 29, 2006 at 15:13 UTC
Ah. That's a tough one. Digests can help, but since two very different sets of data can share a digest... well, it won't be unique. You could do something like a digest prepended with the nth 10 bytes from the file, or something, but even that wouldn't be a guarantee. It comes down to finding something that's unique enough about the data that when it's combined with a digest, you have a "pretty much guaranteed" unique key. MIME type, maybe? Combined with a longer digest (say, SHA-256), that would be pretty good. Two different digests (say, `MD5().SHA256()`) would be a likely candidate, too -- the chance that data "A" will have a digest collision with data "B" in two different digest systems is fanstastically small. <-radiant.matrix-> A collection of thoughts and links from the minds of geeks The Code that can be seen is not the true Code I haven't found a problem yet that can't be solved by a well-placed trebuchet	[reply] [d/l]
Re^4: Simple Digests? by pileofrogs (Priest) on Mar 31, 2006 at 02:20 UTC
Re^3: Simple Digests? by shotgunefx (Parson) on Mar 29, 2006 at 21:04 UTC
Are you retrieving it just by the key? If you're using some other data to get the key, you could use a timestamp prefixed to the md5. Seems like it would be pretty tough to have a collision -Lee "To be civilized is to deny one's nature."	[reply]
Re^4: Simple Digests? by pileofrogs (Priest) on Mar 31, 2006 at 02:21 UTC
Re^5: Simple Digests? by shotgunefx (Parson) on Apr 01, 2006 at 02:02 UTC