Re: Using MD5 and the theory behind it

I'll give you an example of usage from my own real-world. I maintain a database of MP3's that I use for broadcasting. Every 2 hours a perl script (loadmusic) runs through my MP3's looking for new, updated, moved and deleted files.

To determine if an MP3 is the same, I used a Md5 checksum of the file. That way I can apply the following logic:

Same MD5, same directory + filename: the file hasn't changed
Same MD5, different directory + filename: the file is in the database, but has just moved to another location (no need to re-add it)
Different MD5, same directory + filename: the file has been updated (ID3 tags might have changed, or the last time it was scanned, it wasn't complete (download in progress from Napster)).
MD5 doesn't exist in database and filename + directory doesn't exist in database: new file!
In the database, the MD5 specified doesn't match a record and the filename + directory name doesn't exist: file has been deleted! Remove from DB

So, I use it to "link" files on the HD to entries in the MD5. Since the MD5 sum is unique for every file, it works as the perfect identifier (ed.).

In response to ichimunki: Absolutely correct! Of course what I meant to say was "virtually unique" :)

Comment on Re: Using MD5 and the theory behind it

Replies are listed 'Best First'.
Re: Re: Using MD5 and the theory behind it by ichimunki (Priest) on Jan 10, 2001 at 23:22 UTC
Although I'm certain that this approach works, and will continue to work, MD5 sums are not unique for every file. If they were, this would be the ultimate compression algorithm (that is, if the MD5 were unique, you could use it to reverse engineer the file using only the hash because each hash have only one possible antecedent). The odds of two similar files having the same MD5 sum, however, is very low.	[reply]
Re: Re: Re: Using MD5 and the theory behind it by saucepan (Scribe) on Jan 11, 2001 at 02:59 UTC
Using one of these approximations, it looks like the probability of a birthday collision will finally hit 0.5 by about the time mr.nick has processed his 22 million million millionth MP3, so I'd agree that he has nothing to worry about for now. ;)	[reply]