Jenda has asked for the wisdom of the Perl Monks concerning the following question:

G'day folks,
I mail myself the log files from several services (home made) running on several computers. I'd like to be able to parse them and return results like

	"Everything looks normal"
	"1 error found"
	"Found something strange"

I do not really need to count anything in the logs, all I need to know is whether there's something I have to look at in the logs.

The logs look something like this:

Started some action at ...
	A subtask with some options
	Some more options
		whatever
			job id x
			job id y
		something silly
		done
	Another subtask
	with some options
		some nonsense
			job id x
			...
		done
Action succeeded plus some more info at ....

If the services encounters an error and handles it properly it prints something like

	ERROR: ....
	The action failed at ...
and goes on to next task, but I do not want to just search for "ERROR:". I'd like to catch all "unexpected" texts.

Does anyone have any neat idea how to implement this (taking into account that the expected messages change as we implement additional features!)? Any pointers, suggested modules or articles? Any examples?

I don't need you to write this for me, I'm just interested in ideas.

Thanks, Jenda

== Jenda@Krynicky.cz == http://Jenda.Krynicky.cz ==
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
      -- Rick Osborne, osborne@gateway.grumman.com

Replies are listed 'Best First'.
Re: Multiline log parsing
by Aristotle (Chancellor) on Sep 24, 2002 at 14:06 UTC
    Well, if you don't know what error messages to expect, you'll have to write patterns for the known good messages and check if any lines fail to validate as known good. They must be bad then. Of course you'll probably get a couple false alarms, but you can watch those and tune your patterns.

    Makeshifts last the longest.

Re: Multiline log parsing
by rje (Deacon) on Sep 24, 2002 at 14:26 UTC
    If you have control over the creation of the multiline logs, maybe you can help things out by always terminating logs with multiple newlines (or a meaningful token); then, by setting the input record terminator to "\n\n" (or said meaningful token), you can easily grapple with one record at a time when you read the logfiles in...

    At any rate, your script will probably have to know what keywords to look for, perhaps read from a config file. The config file might just be a plain list of words, or a list of patterns (or a list of substitutions!), or even another perl module which maps keywords to function calls, closures, state changes, what-have-you. Might get a little messy tho.

    But if your script is just flagging down things for you to manually check out, then a simple hashtable of keywords and their resulting warning text might do.

      I'm afraid the regexp that'd match the whole file would be too long and messy. And testing each line separately whether its something unexpected is not good enough, I need to test the lines in context.

      But I like this config file with patterns and state changes idea. I think I'll use something like

      ( START => { '^FileCreate \d+\.\d+.\d+$' => 'START', '^---- Ticking: \d{4}/\d\d/\d\d \d\d:\d\d:\d\d - \d\d:\d\d:\d\ +d$' => 'START', '^Creating files for site ' => 'FILES' }, FILES => { '^\tCreating file ' => 'FILE', '^File generation succeeded for site ' => 'START', '^File generation failed for site ' => '--ERROR--', '^Jobs for site \d+ with parameter type "\w+" are to be proces +sed by HTTPPost or something.' => 'FILES', '^Site \d+ has posting parameters either only for single or fo +r package jobs!!!' => 'FILES', }, ... )
      or maybe
      ( START => [ qr'^FileCreate \d+\.\d+.\d+$' => 'START', qr'^---- Ticking: \d{4}/\d\d/\d\d \d\d:\d\d:\d\d - \d\d:\d\d:\ +d\d$' => 'START', qr'^Creating files for site ' => 'FILES' ], FILES => [ qr'^\tCreating file ' => 'FILE', qr'^File generation succeeded for site ' => 'START', qr'^File generation failed for site ' => '--ERROR--', qr'^Jobs for site \d+ with parameter type "\w+" are to be proc +essed by HTTPPost or something.' => 'FILES', qr'^Site \d+ has posting parameters either only for single or +for package jobs!!!' => 'FILES', ], ... )
      and read it with do() or eval().

      The second has two advantages. The regexps will be precompiled and they will be tested in a dependable order. But the code will look a little awkward.

      Thanks for your ideas, Jenda

Re: Multiline log parsing
by sauoq (Abbot) on Sep 24, 2002 at 20:39 UTC

    If I was doing this, I'd write a state machine to help me parse the file. I'd use regexen to match the relevant lines and extract information out of them. Based on the limited info you provided, I'd use states like "StartAction", "StartSubtask", "SubtaskOptions", "Job", "EndSubtask", "EndAction", etc.

    Unless you set out to create a parseable grammar, you might run into problems but since you have control over the grammar as well as the parser you should be able to fiddle with things and make them work.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Multiline log parsing
by VSarkiss (Monsignor) on Sep 24, 2002 at 15:16 UTC

    Well, you've got to be looking for something. The fact that the log message has multiple lines is immaterial. Given that the files are small enough to mail yourself, slurp the entire file and do a pattern match against the whole thing at once.

    Relevant areas to look up: perlvar for $/, which, when undefined will cause the diamond operator to read an entire file into a scalar; and perlre for the /m and /s modifiers for matching newlines.