in reply to What is MD5 Hashing and Why is it Important?
i think i inadvertantly steered the discussion almost entirely towards cryptography before so i'm going to try to make up for that by giving an example of how MD5 hashing would be used that doesn't involve cryptography.
say you have a directory with a few thousand files in it that came from various sources but it's quite likely that some are duplicates, just named differently (images or mp3s downloaded from gnutella or something). you want to get rid of the duplicates. so you need some way of uniquely identifying the contents of each file and checking them all against each other. the naive approach is to just loop over all the files in the directory comparing the first one to every other file bit for bit, then compare the second one to all the other files bit for bit.
a faster (but still fairly naive) approach is to look at the exact sizes of the files and then only compare the ones that are the same size. better, but not so good if you're dealing with small files or files that for some reason are frequently the same size.
the way it should be done is to go through the directory calculating the MD5 hash of each file and inserting that as a key into a %seen hash. before you actually insert it, you check if that key has already been entered into the hash; if it has, you know you've got a duplicate. the MD5 hash is only 16 bytes so you can do this on quite a few files without chewing up all your memory. the slowest part of this algorithm is probably going to be the step where you read in each file once before calculating the hash. you're going to have to do this for any algorithm that finds duplicates so that's ok. after that it's going to be pretty fast and robust. if you're dealing with millions or billions of files you'd just change the in memory hash to a tied hash on disk (berkely DB or something).
AxKit probably uses MD5 hashes to see if files it has cached have changed since they were cached. instead of comparing the file in its cache to the original everytime it's served, it just stores the MD5 hash of the (unprocessed) cached version and compares that to an MD5 of the original. if they're the same, it can serve the file from the cache; otherwise it re-processes the original. storing and checking timestamps would get you most of the same efficiency but might be more prone to errors if the system clock changes.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re (tilly) 2: What is MD5 Hashing and Why is it Important?
by tilly (Archbishop) on Feb 15, 2002 at 22:21 UTC |