Logfile parsing across redundant files

thezip has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I have a log parsing problem, and I seek suggestions as to a reasonable Perlish solution -- I'm not really looking for any code per se, just algorithmic "advice".

First, I'll address the things that are given and cannot be changed.

Each day, I collect a dump from a logfile generator, which is the accumulation of all log entries since the beginning of that month. Each day, a new file is collected, and is theoretically at least as big as the previous day's file. I do not have the ability to directly control this "logfile source", so I must deal with the cumulative nature of the resulting files.

Occasionally, through magic processes that I also have no control over, there may a purging of the "logfile source", which consequently causes the next day's cumulative file size to restart from 0 bytes, and then contain only what was collected after the purge.

My program must "reconstruct" all of the unique log entries for the given month for a given server.

Assumptions:

A log entry is discrete, and can be uniquely identified by its timestamp (and potentially other key data if necessary)
Time and space are of minimal consequence, although I'd like this to run in a reasonable amount of time
There are 10 servers, each which will generate one log file per day of month
There are approximately 5000 log entries available at the end of each month for a server
Each log entry is approximately 250 chars in length

In summary, there will be around 310 files, each having size somewhat over 1.2 MB -- nothing major. Each server will have its logs unique-ified into its own file.

Certainly, in Unix, I could do something like:

1) Concatenate files into a single file
2) Then do:  `sort -u <concatfile>  >  <sortedfile>`
[download]

... but I suspect this will eventually live on a Windows box.

I thought that maybe I could do an MD5 digest for each log entry, and then use that as a hash key for subsequent collision checks (ie. ignore all subsequent redundancies).

Thoughts?

Where do you want *them* to go today?

Comment on Logfile parsing across redundant files Download Code

Replies are listed 'Best First'.
Re: Logfile parsing across redundant files by BrowserUk (Patriarch) on Feb 02, 2007 at 06:45 UTC
Certainly, in Unix, I could do something like: 1) Concatenate files into a single file 2) Then do: `sort -u <concatfile> > <sortedfile>` ... but I suspect this will eventually live on a Windows box. Get your self the sort program from UnxUtils. It supports `-u` which appears to be all you require to use your unix solution on a windows box: c:\>u:sort --help Usage: u:sort [OPTION]... [FILE]... Write sorted concatenation of all FILE(s) to standard output. Ordering options: Mandatory arguments to long options are mandatory for short options to +o. -b, --ignore-leading-blanks ignore leading blanks -d, --dictionary-order consider only blanks and alphanumeric ch +aracters -f, --ignore-case fold lower case to upper case characters -g, --general-numeric-sort compare according to general numerical v +alue -i, --ignore-nonprinting consider only printable characters -M, --month-sort compare (unknown) < `JAN' < ... < `DEC' -n, --numeric-sort compare according to string numerical va +lue -r, --reverse reverse the result of comparisons Other options: -c, --check check whether input is sorted; do not sort -k, --key=POS1[,POS2] start a key at POS1, end it at POS 2 (orig +in 1) -m, --merge merge already sorted files; do not sort -o, --output=FILE write result to FILE instead of standard o +utput -s, --stable stabilize sort by disabling last-resort co +mparison -S, --buffer-size=SIZE use SIZE for main memory buffer -t, --field-separator=SEP use SEP instead of non- to whitespace tran +sition -T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR +or c:/temp multiple options specify multiple direct +ories -u, --unique with -c: check for strict ordering otherwise: output only the first of an e +qual run -z, --zero-terminated end lines with 0 byte, not newline --help display this help and exit --version output version information and exit POS is F[.C][OPTS], where F is the field number and C the character po +sition in the field. OPTS is one or more single-letter ordering options, whi +ch override global ordering options for that key. If no key is given, us +e the entire line as the key. SIZE may be followed by the following multiplicative suffixes: % 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, + Y. With no FILE, or when FILE is -, read standard input. * WARNING * The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. Report bugs to <bug-textutils@gnu.org>. [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: Logfile parsing across redundant files by thezip (Vicar) on Feb 02, 2007 at 06:51 UTC
Point well taken, TYVM. Now what's the Perlish way to solve this? Where do you want them* to go today?*	[reply]
Re^3: Logfile parsing across redundant files by BrowserUk (Patriarch) on Feb 02, 2007 at 07:10 UTC
On the basis of what you've said about the data, it could be as simple as this: `#! perl -slw use strict; my $dir = $ARGV[ 0 ] \|\| die 'Need a directory'; my %hash; while( my $file = <"$dir/*.log"> ) { open my $fh, '<', $file or die "$file : $!"; while( <$fh> ) { $hash{ $_ } = 1; } close $fh; } open my $fh, '>', "$dir/composite.log" or die $!; print $fh $_ for sort keys %hash; close $fh;` [download] This assumes that all 31 log files from a particular server are located in a single directory, no other files are in that directory, and that the lines can be sorted using an alphanumeric sort. Eg. Each line carries a date/time stamp at the beginning of the line, and it is ordered in some sensible form (YYYYMMDD HH:MM:SS) that will sort correctly. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^4: Logfile parsing across redundant files by thezip (Vicar) on Feb 02, 2007 at 07:20 UTC
Re^5: Logfile parsing across redundant files by BrowserUk (Patriarch) on Feb 02, 2007 at 07:40 UTC
Re^2: Logfile parsing across redundant files by shenme (Priest) on Feb 03, 2007 at 19:07 UTC
Having hit this just last night I thought I'd ask: how long has UnxUtils actually been unavailable? Clicking either of the .zip links gets you "You don't have permission to access /UnxUpdates.zip on this server." I'd love to equip my brethren des fenętres with some basic tools (like 'wc') but don't want to ask them to do Cygwin.	[reply]
Re^3: Logfile parsing across redundant files by BrowserUk (Patriarch) on Feb 03, 2007 at 19:22 UTC
Hmm. I hadn't realised that the links didn't work. I just searched for them, but didn't check them as I have had my copies for years. And a quick browse doesn't turn up any reason either? If you want a copy, /msg me an email address and I'll forward it to you. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: Logfile parsing across redundant files by roboticus (Chancellor) on Feb 02, 2007 at 12:04 UTC
thezip: Perhaps you could simply keep track of the file position of the last log entry you've processed, and then process all entries found after that. If the file position is smaller than the previous day, then take all the lines. Something like (untested top-of-the-head) this: # Read logfile names with last position read yesterday open $inf, "<", "loglist.txt" or die; while (<$inf>) { chomp; my ($fname, $fpos) = split /\\|/; $logs{$fname} = $fpos; } close $inf; # Get new lines from each file for my $fname (keys %logs) { open $ouf, '>>', $fname . ".cumulative" or die; open $inf, '<', $fname or die; if ($logs{$fname} < stat($inf)[7]) { # Continue from where we left off yesterday seek $inf, $logs{$fname}, SEEK_SET; } else { # start at beginning of file } while (<$inf>) { print $ouf $_; } $logs{$fname} = tell $inf; close $inf; close $ouf; } # Rewrite list of files and positions open $ouf, ">", "loglist.txt" or die; print $ouf join "\n", map { $_ . '\|' . $logs{$_} } keys %logs; close $ouf; [download] --roboticus	[reply] [d/l]
Re^2: Logfile parsing across redundant files by BrowserUk (Patriarch) on Feb 02, 2007 at 13:33 UTC
Doesn't that leave a hole in the logic where today's file was truncated, but grew to be larger than yesterday's file? For example a busy Monday after a quiet Sunday? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: Logfile parsing across redundant files by roboticus (Chancellor) on Feb 02, 2007 at 22:37 UTC
BrowserUk: Good catch! I guess we'd have to add some code to cache the first line as well. If the first line is the same, do the same thing as above. If not, then cache the first line and take the whole file. --roboticus	[reply]