Re: What is MD5 Hashing and Why is it Important?
by thraxil (Prior) on Feb 15, 2002 at 16:04 UTC
|
MD5 (message digest algorithm) or other hashes are one way functions that produce a "fingerprint". essentially, they map something with a lot of bits down to just a few bits (128 in the case of MD5) in such a way that collisions are as rare as possible.
this is useful because you can compare and store these small hashes much more easily than the entire original sequences. in cryptography, one-way hashes are used to verify something without necessarily giving away the original information. eg, unix stores hashes of passwords instead of the passwords themselves. when a user enters their password, the system computes the hash of it and compares it to the hashes listed in /etc/passwd. since you can't run the hash function in reverse, the system knows that the password you entered is the right one. the crypt that unix uses doesn't really reduce the size but is a similar idea. hashes and digests like MD5 are an integral part of digital signatures.
MD5 is just one of many hashing algorithms. it was developed by Ronald Rivest (the R in RSA) in 1991. it's an updated version of MD4. there's also SHA-1 (Secure Hash Algorithm), developed by NIST in conjunction with the NSA and produces 160 bit digest. it's more collision resistant than MD5 and is the most commonly used hash for cryptographic purposes these days. there are others, but these are the important ones.
anders pearson
| [reply] |
|
|
If I recall correctly a well designed hashing algorithm should cause approximately 50% of it bits to change when 1 bit of its input is different.
But perhaps you can tell me, if SHA1 has less collisions than MD5 why do I see MD5 used more often than SHA1? (In non cryptographic areas anyway) Is MD5 more efficient to generate?
Yves / DeMerphq
--
When to use Prototypes?
| [reply] |
|
|
Part of the reason why MD5 is still around is because it's so common. It does have a greater collission risk than SHA1 and this makes it more vulnerable (I'll explain below). However, it is quicker to generate an MD5 digest than SHA1. If you're forced to generate many digests, you'll prefer MD5.
The reason why these hashing algorithms are so slow is because they were designed to be slow. Consider what happens when a cracker gets your /etc/passwd file (assuming you don't use /etc/shadow). Each entry will have the password hashed and that will resemble the following:
$1$1PUXLuZE$P.LfclRO9SKqTf2BQK.yD1
The 1PUXLuZE is the salt. With a crack program, you use the salt with a list of likely passwords to try to recreate LfclRO9SKqTf2BQK.yD1. If you do, you have the password. If there is a collission (more than one password will generate that string), then security is tremendously weakened.
Now, if most users have a password like F&832*,--?, those probably aren't going to get cracker. However, someone is going to violate your password policy and fail to understand how p4$$w0rd1 was cracked. If the cracker is running crack, though, they could easily run the program for a week before getting to the insecure password. But, if you have collissions, this time could be reduced significantly. SHA1 avoids this vulnerability and also takes longer to compute.
As of a month and a half ago, I didn't know any of this. I only learned when I asked for feedback on my CGI course and mdillon replied with this node.
Cheers,
Ovid
Update: Read the follow-ups to this post!
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats. | [reply] [d/l] |
|
|
|
|
|
|
basically. it's much faster to compute the 128 bit digest for md5 than the 160 bit digest for SHA-1. since MD5 is good enough for non-cryptographic purposes, the speed advantage makes it a better choice in most cases.
i don't remember exactly what the weakness was in MD5 but at some point they found that with certain kinds of input MD5 was more likely to produce collisions than it should. the situations that produced this behaviour are pretty rare normally but could be taken advantage of to weaken it in cryptographic applications.
if you want to know more about the specifics of the weakness, read these:
M.J.B. Robshaw, On Recent Results for MD2, MD4 and MD5, RSA Laboratories Bulletin 4 (pdf) (November 1996).
B. den Boer and A. Bosselaers, Collisions for the compression function of MD5, Advances in Cryptology - Eurocrypt '93, Springer-Verlag (1994), 293-304.
H. Dobbertin, The Status of MD5 After a Recent Attack, CryptoBytes (2) 2 (Summer 1996).
anders pearson
| [reply] |
Re: What is MD5 Hashing and Why is it Important?
by thraxil (Prior) on Feb 15, 2002 at 21:19 UTC
|
i think i inadvertantly steered the discussion almost entirely towards cryptography before so i'm going to try to make up for that by giving an example of how MD5 hashing would be used that doesn't involve cryptography.
say you have a directory with a few thousand files in it that came from various sources but it's quite likely that some are duplicates, just named differently (images or mp3s downloaded from gnutella or something). you want to get rid of the duplicates. so you need some way of uniquely identifying the contents of each file and checking them all against each other. the naive approach is to just loop over all the files in the directory comparing the first one to every other file bit for bit, then compare the second one to all the other files bit for bit.
a faster (but still fairly naive) approach is to look at the exact sizes of the files and then only compare the ones that are the same size. better, but not so good if you're dealing with small files or files that for some reason are frequently the same size.
the way it should be done is to go through the directory calculating the MD5 hash of each file and inserting that as a key into a %seen hash. before you actually insert it, you check if that key has already been entered into the hash; if it has, you know you've got a duplicate. the MD5 hash is only 16 bytes so you can do this on quite a few files without chewing up all your memory. the slowest part of this algorithm is probably going to be the step where you read in each file once before calculating the hash. you're going to have to do this for any algorithm that finds duplicates so that's ok. after that it's going to be pretty fast and robust. if you're dealing with millions or billions of files you'd just change the in memory hash to a tied hash on disk (berkely DB or something).
AxKit probably uses MD5 hashes to see if files it has cached have changed since they were cached. instead of comparing the file in its cache to the original everytime it's served, it just stores the MD5 hash of the (unprocessed) cached version and compares that to an MD5 of the original. if they're the same, it can serve the file from the cache; otherwise it re-processes the original. storing and checking timestamps would get you most of the same efficiency but might be more prone to errors if the system clock changes.
anders pearson
| [reply] |
|
|
Here are several related uses.
Inline uses hashing to know when it has to recompile and relink libraries, versus when it has already done so and can use the old version. This means that you only pay for compilation once.
Rsync uses it to identify when and where files have changed without having to send the whole file across the wire. Instead it sends md5 hashes back and forth, and when they differ it starts sending hashes of part of the file doing a binary search to generate a diff which it can then use to transfer a patch with which it updates the file.
In a website you might use it to generate a handle representing a complex data structure so that you can store the user information locally and just pass a small amount of information back and forth to the client. An example of why you would want to do this is when you have a gif whose display requires a large amount of form data to put together. You can't pass all of the information in the URL for the gif. You can pass the appropriate MD5 hash.
| [reply] |
Re: What is MD5 Hashing and Why is it Important?
by mpeppler (Vicar) on Feb 15, 2002 at 16:32 UTC
|
MD5 hashing is also used to verify the autheticity of a piece of data without having to actually encrypt the data. You do that computing the MD5 hash of the data with an additional shared secret element. This works pretty well.
Michael | [reply] |
|
|
| [reply] |
Re: What is MD5 Hashing and Why is it Important?
by cacharbe (Curate) on Feb 15, 2002 at 18:44 UTC
|
Allow me to point you at one of the best textbooks regarding Cryptography - aside from Schneier's work (and that I mentioned in this node) broken down by chapter in pdf.
You'll want to read Chapter 9. The intro (9.1) should describe hashing as far as you need it, and the rest of the chapter talks about methods and specific algorithms.
C-.
| [reply] |
Re: What is MD5 Hashing and Why is it Important?
by jlongino (Parson) on Feb 15, 2002 at 21:27 UTC
|
A simple query to Google on "MD5 hashing" returned many
results but
one,
close to the top immediately caught my attention due to the
site's name (www.w3.org). That link and the ones within it
pretty much answers your questions.
--Jim | [reply] |
Re: What is MD5 Hashing and Why is it Important?
by theorbtwo (Prior) on Feb 16, 2002 at 06:02 UTC
|
BTW, in order to get /anything/ installed using CPAN, it'll recommend you get Digest::MD5. The reason is that CPAN keeps the md5sums of every module around in order to be able to verify that your download was acatualy successful. (If there were download errors, the md5sum would be different.)
However, PGP in specific probably wants Digest::MD5 because PGP uses MD5 digests for some of it's work. (SHA-1 would be more secure, but PGP is older then SHA-1.)
We are using here a powerful strategy of synthesis: wishful thinking. -- The Wizard Book
| [reply] |