Re: What is MD5 Hashing and Why is it Important?

i think i inadvertantly steered the discussion almost entirely towards cryptography before so i'm going to try to make up for that by giving an example of how MD5 hashing would be used that doesn't involve cryptography.

say you have a directory with a few thousand files in it that came from various sources but it's quite likely that some are duplicates, just named differently (images or mp3s downloaded from gnutella or something). you want to get rid of the duplicates. so you need some way of uniquely identifying the contents of each file and checking them all against each other. the naive approach is to just loop over all the files in the directory comparing the first one to every other file bit for bit, then compare the second one to all the other files bit for bit.

a faster (but still fairly naive) approach is to look at the exact sizes of the files and then only compare the ones that are the same size. better, but not so good if you're dealing with small files or files that for some reason are frequently the same size.

the way it should be done is to go through the directory calculating the MD5 hash of each file and inserting that as a key into a %seen hash. before you actually insert it, you check if that key has already been entered into the hash; if it has, you know you've got a duplicate. the MD5 hash is only 16 bytes so you can do this on quite a few files without chewing up all your memory. the slowest part of this algorithm is probably going to be the step where you read in each file once before calculating the hash. you're going to have to do this for any algorithm that finds duplicates so that's ok. after that it's going to be pretty fast and robust. if you're dealing with millions or billions of files you'd just change the in memory hash to a tied hash on disk (berkely DB or something).

AxKit probably uses MD5 hashes to see if files it has cached have changed since they were cached. instead of comparing the file in its cache to the original everytime it's served, it just stores the MD5 hash of the (unprocessed) cached version and compares that to an MD5 of the original. if they're the same, it can serve the file from the cache; otherwise it re-processes the original. storing and checking timestamps would get you most of the same efficiency but might be more prone to errors if the system clock changes.

anders pearson

Comment on Re: What is MD5 Hashing and Why is it Important?

Replies are listed 'Best First'.
Re (tilly) 2: What is MD5 Hashing and Why is it Important? by tilly (Archbishop) on Feb 15, 2002 at 22:21 UTC
Here are several related uses. Inline uses hashing to know when it has to recompile and relink libraries, versus when it has already done so and can use the old version. This means that you only pay for compilation once. Rsync uses it to identify when and where files have changed without having to send the whole file across the wire. Instead it sends md5 hashes back and forth, and when they differ it starts sending hashes of part of the file doing a binary search to generate a diff which it can then use to transfer a patch with which it updates the file. In a website you might use it to generate a handle representing a complex data structure so that you can store the user information locally and just pass a small amount of information back and forth to the client. An example of why you would want to do this is when you have a gif whose display requires a large amount of form data to put together. You can't pass all of the information in the URL for the gif. You can pass the appropriate MD5 hash.	[reply]

Replies are listed 'Best First'.

Re (tilly) 2: What is MD5 Hashing and Why is it Important?
by tilly (Archbishop) on Feb 15, 2002 at 22:21 UTC

Inline uses hashing to know when it has to recompile and relink libraries, versus when it has already done so and can use the old version. This means that you only pay for compilation once.

Rsync uses it to identify when and where files have changed without having to send the whole file across the wire. Instead it sends md5 hashes back and forth, and when they differ it starts sending hashes of part of the file doing a binary search to generate a diff which it can then use to transfer a patch with which it updates the file.

In a website you might use it to generate a handle representing a complex data structure so that you can store the user information locally and just pass a small amount of information back and forth to the client. An example of why you would want to do this is when you have a gif whose display requires a large amount of form data to put together. You can't pass all of the information in the URL for the gif. You can pass the appropriate MD5 hash.

[reply]