in reply to Re^3: Logfile parsing across redundant files
in thread Logfile parsing across redundant files

Which illustrates my question: Is using the entire line as the hash key *better* than using an MD5 digest of the line as the hash key?

Where do you want *them* to go today?
  • Comment on Re^4: Logfile parsing across redundant files

Replies are listed 'Best First'.
Re^5: Logfile parsing across redundant files
by BrowserUk (Patriarch) on Feb 02, 2007 at 07:40 UTC
    Is using the entire line as the hash key *better* than using an MD5 digest of the line as the hash key?

    Why do you think that using an MD5 hash is necessary or useful or preferrable?

    Perl's hashes are very well tried and tested. Here's some reasons why I'd do it this way:

    • It's less work.

      There's nothing to do, Perl's hashes simply work.

      The only reason I know for not using them is that they can be memory hungry. With the tiny volumes of data you are talking about this is not a problem

    • It's faster.

      Perl's hashing algorithm is way, way faster than MD5.

    • MD5s are not unique.

      Collisions may be rare, but they are absolutely possible. All algorithms that rely upon the uniqueness of MD5s should incorporate mechanisms to detect those collisions, no matter how rare they are.

    • Perl's hashes have a hash collision mechanism built-in.

    • Update: Oh. I almost forgot this one. You'd still be using and relying upon Perl's hashes anyway.

      You'd simply be hashing the ascii or binary representation of the MD5 hash of the entire line. What could that possibly buy you?

    Does that answer your question?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.