in reply to Re^2: Logfile parsing across redundant files
in thread Logfile parsing across redundant files

On the basis of what you've said about the data, it could be as simple as this:

#! perl -slw use strict; my $dir = $ARGV[ 0 ] || die 'Need a directory'; my %hash; while( my $file = <"$dir/*.log"> ) { open my $fh, '<', $file or die "$file : $!"; while( <$fh> ) { $hash{ $_ } = 1; } close $fh; } open my $fh, '>', "$dir/composite.log" or die $!; print $fh $_ for sort keys %hash; close $fh;

This assumes that all 31 log files from a particular server are located in a single directory, no other files are in that directory, and that the lines can be sorted using an alphanumeric sort. Eg. Each line carries a date/time stamp at the beginning of the line, and it is ordered in some sensible form (YYYYMMDD HH:MM:SS) that will sort correctly.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^4: Logfile parsing across redundant files
by thezip (Vicar) on Feb 02, 2007 at 07:20 UTC

    Which illustrates my question: Is using the entire line as the hash key *better* than using an MD5 digest of the line as the hash key?

    Where do you want *them* to go today?
      Is using the entire line as the hash key *better* than using an MD5 digest of the line as the hash key?

      Why do you think that using an MD5 hash is necessary or useful or preferrable?

      Perl's hashes are very well tried and tested. Here's some reasons why I'd do it this way:

      • It's less work.

        There's nothing to do, Perl's hashes simply work.

        The only reason I know for not using them is that they can be memory hungry. With the tiny volumes of data you are talking about this is not a problem

      • It's faster.

        Perl's hashing algorithm is way, way faster than MD5.

      • MD5s are not unique.

        Collisions may be rare, but they are absolutely possible. All algorithms that rely upon the uniqueness of MD5s should incorporate mechanisms to detect those collisions, no matter how rare they are.

      • Perl's hashes have a hash collision mechanism built-in.

      • Update: Oh. I almost forgot this one. You'd still be using and relying upon Perl's hashes anyway.

        You'd simply be hashing the ascii or binary representation of the MD5 hash of the entire line. What could that possibly buy you?

      Does that answer your question?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.