Re^4: Logfile parsing across redundant files

Replies are listed 'Best First'.
Re^5: Logfile parsing across redundant files by BrowserUk (Patriarch) on Feb 02, 2007 at 07:40 UTC
Is using the entire line as the hash key better* than using an MD5 digest of the line as the hash key?* Why do you think that using an MD5 hash is necessary or useful or preferrable? Perl's hashes are very well tried and tested. Here's some reasons why I'd do it this way: It's less work. There's nothing to do, Perl's hashes simply work. The only reason I know for not using them is that they can be memory hungry. With the tiny volumes of data you are talking about this is not a problem It's faster. Perl's hashing algorithm is way, way faster than MD5. MD5s are not unique. Collisions may be rare, but they are absolutely possible. All algorithms that rely upon the uniqueness of MD5s should incorporate mechanisms to detect those collisions, no matter how rare they are. Perl's hashes have a hash collision mechanism built-in. Update: Oh. I almost forgot this one. You'd still be using and relying upon Perl's hashes anyway. You'd simply be hashing the ascii or binary representation of the MD5 hash of the entire line. What could that possibly buy you? Does that answer your question? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]

Replies are listed 'Best First'.

Re^5: Logfile parsing across redundant files
by BrowserUk (Patriarch) on Feb 02, 2007 at 07:40 UTC

Is using the entire line as the hash key *better* than using an MD5 digest of the line as the hash key?

Why do you think that using an MD5 hash is necessary or useful or preferrable?

Perl's hashes are very well tried and tested. Here's some reasons why I'd do it this way:

It's less work.
There's nothing to do, Perl's hashes simply work.
The only reason I know for not using them is that they can be memory hungry. With the tiny volumes of data you are talking about this is not a problem
It's faster.
Perl's hashing algorithm is way, way faster than MD5.
MD5s are not unique.
Collisions may be rare, but they are absolutely possible. All algorithms that rely upon the uniqueness of MD5s should incorporate mechanisms to detect those collisions, no matter how rare they are.
Perl's hashes have a hash collision mechanism built-in.
Update: Oh. I almost forgot this one. You'd still be using and relying upon Perl's hashes anyway.
You'd simply be hashing the ascii or binary representation of the MD5 hash of the entire line. What could that possibly buy you?

Does that answer your question?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

[reply]