adaptive syslog message parsing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

i have a syslog server, which gets almost 80 msgs per second from all our sytems, so i'm trying to implement a solution for making the messages easier to digest, since it sends an email daily with all the days messages, and it's now grown to several megs.. who wants to read every line?

my plan was to have my perl script generate an html page of collapsable lists.. something like

- host
  - process
    - error type (count)

but since the messages come from a variety of sources/daemons, in a variety of formats, am kind of stumped on how to approach parsing it..

if the format was always the same, i'd have no problem.. even if there were only a certain number of formats, i could time consumingly come up with regexes to parse them all, but this is not practical given the exact format is no known..

so the part of writing the script i'm stuck on is implementing an adaptive parsing algorithm, to build a nested hashref "tree"..

i even thought about using a complex set of substr's and indexof's calls , but my head hurts after a while trying to figure out how to make it adaptive.

here's a sample of syslog messages..

mail2-out - ntpd: sendto(192.168.4.10): Bad file descriptor
mail2-out - postfix/smtp: warning: valid_hostname: empty hostname
mail2-out - postfix/smtp: warning: malformed domain name in resource d
+ata of MX record for hotmil.com: 
mail2-out - ntpd: sendto(192.168.4.20): Bad file descriptor
mail2-out - ntpd: sendto(192.168.4.10): Bad file descriptor
mail2-out - postfix/smtp[32282]: warning: numeric domain name in resou
+rce data of MX record for uyahoo.com: 10.0.0.2
mail2-out - ntpd: sendto(192.168.4.20): Bad file descriptor
mail2-out - ntpd: sendto(192.168.4.10): Bad file descriptor
infocache02 - ldap_cachemgr: libsldap: Status: 91  Mesg: openConnectio
+n: simple bind failed - Can't connect to the LDAP server
infocache02 - ldap_cachemgr: Error: Unable to refresh from profile:tls
+_automount_profile. (error=1)
infocache02 - sendmail: l560aB7V017120: Losing ./qfl560aB7V017120: sav
+email panic
infocache02 - sendmail: l560aB7V017120: SYSERR(root): savemail: cannot
+ save rejected email anywhere
infocache02 - sendmail: l1FM2rFa026352: Losing ./qfl1FM2rFa026352: sav
+email panic
infocache02 - sendmail: l1FM2rFa026352: SYSERR(root): savemail: cannot
+ save rejected email anywhere
infocache02 - sendmail: l1FI2rFa022597: Losing ./qfl1FI2rFa022597: sav
+email panic
mail2-in - postfix/smtpd: warning: 190.55.102.166: hostname cpe-190-55
+-102-166.telecentro.com.ar verification failed: hostname nor servname
+ provided, or not known
mail2-in - postfix/smtpd: warning: 201.29.80.154: hostname 20129080154
+.user.veloxzone.com.br verification failed: hostname nor servname pro
+vided, or not known
mail2-in - postfix/smtpd: warning: 84.9.96.201: address not listed for
+ hostname mail.intechcentre.com
mail2-in - postfix/smtpd: warning: 84.9.96.201: address not listed for
+ hostname mail.intechcentre.com
mail2-in - postfix/smtpd: warning: 190.8.87.73: hostname din-190-8-87-
+73.manquehue.net verification failed: hostname nor servname provided,
+ or not known
mail2-in - postfix/smtpd: warning: 190.8.87.73: hostname din-190-8-87-
+73.manquehue.net verification failed: hostname nor servname provided,
+ or not known
[download]

and here is how i would want it parsed (imagine Data::Dumper except with indentations).. once parsed, i could figure out how to make output like this:

infocache02
	ldap_cachemgr
		(1) libsldap: Status: 91  Mesg: openConnection: simple bind failed - Can't connect to the LDAP server
		(1) Error: Unable to refresh from profile:tls_automount_profile. (error=1)
	sendmail
		(3) **************: Losing ************** savemail panic
		(2) SYSERR(root): savemail: cannot save rejected email anywhere
mail2-out
	ntpd
		(5) sendto(************): Bad file descriptor
	postfix/smtp
		(1) warning: valid_hostname: empty hostname
		(1) warning: malformed domain name in resource data of MX record for ********:
		(1) warning: numeric domain name in resource data of MX record for *******: ********
mail2-in
	postfix/smtpd
		(4) warning: ********: hostname ********** verification failed: hostname nor servname provided, or not known
		(2) warning: ********: address not listed for hostname **********

basically what i could use a hand with is figuring out how to have the code determine which part of each line is variable (with multiple variations possible per line).. it would rely on analyzing all the other available messages and keeping track of which part of the line varied (with respect to similar lines.. i seem to remember perl has fuzzy regex matching, but i don't know what kind of regex would be fuzzy enough heh.. i've been writing perl code for a few years now, but this kind of problem i have never encountered before and would appreciate any help and pointers in the right direction..

Comment on adaptive syslog message parsing Download Code

Replies are listed 'Best First'.
Re: adaptive syslog message parsing by BrowserUk (Patriarch) on Jun 06, 2007 at 23:30 UTC
This makes several assumptions based soley upon the sample data provided: Updated: Simplified 1 regex and improved another. #! perl -slw use strict; my %log; while( <DATA> ) { my( $src, $mode, $rest ) = m' ( ^ \S+ ) \s+ - \s+ ( [^\[:]+ ) (?: \[ \d+ \] )? : \s* ( .+ $ ) 'x; ## $rest =~ s[ (?: \S+ \. ){1,4} \S+ ][**]gx; $rest =~ s[ (?: [\w-]+ \. ){1,4} [\w-]+ ][*]gx; ## $rest =~ s[ [a-z] (?= [^:] [A-Z] [^:\s]+ \d ) [^:\s]+ $rest =~ s[ [a-z] \w+ \d : ][**]gx; ++$log{ $src }{ $mode }{$rest}; } for my $src ( sort keys %log ) { print $src; for my $mode ( sort keys %{ $log{ $src } } ) { print " $mode"; print " ($log{ $src}{ $mode }{ $_ }) $_" for sort keys %{ $log{ $src}{ $mode } }; } } __DATA__ your sample data [download] Produces (after update): C:\test>junk5 infocache02 ldap_cachemgr (1) Error: Unable to refresh from profile:tls_automount_profil +e. (error=1) (1) libsldap: Status: 91 Mesg: openConnection: simple bind fa +iled - Can't connect to the LDAP server sendmail (3) Losing ./ savemail panic (2) SYSERR(root): savemail: cannot save rejected email an +ywhere mail2-in postfix/smtpd (2) warning: : address not listed for (4) warning: : hostname verification failed: hostname + nor servname provided, or not known mail2-out ntpd (5) sendto(): Bad file descriptor postfix/smtp (1) warning: malformed domain name in resource data of MX reco +rd for : (1) warning: numeric domain name in resource data of MX record + for : ** (1) warning: valid_hostname: empty hostname [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: adaptive syslog message parsing by Anonymous Monk on Jun 07, 2007 at 17:25 UTC
this is VERY promising and is doing almost exactly what i need, although i think it could be tweaked a little more to generalize the data a little (it outputs decent results on my more complete set of data, only it has many duplicates where messages are very similar.. ) it has also no problem with the fact that i'm normally using fqdn's instead of hosts (i only removed fqdn for brevity's sake), which is a testament to how resilient the regex is.. i need to study exactly how you did what you did so far.. any further help is most welcome too	[reply]
Re^3: adaptive syslog message parsing by BrowserUk (Patriarch) on Jun 07, 2007 at 22:33 UTC
This is about as far as I think I would go. The remaining duplicates are where the user appears to supply random junk in place of commands or addresses. You could certainly add a few more special cases if they are frequent. The modified code: while( <> ) { next if /^\s$/; ## Skip blank lines my( $src, $mode, $rest ) = m' ( ^ \S+ ) \s+ - \s+ ( [^\[:]+ ) (?: \[ \d+ \] )? : \s ( .+ $ ) 'x; if( $rest =~ m[warning: (?=.Illegal address syntax)] ) { ++$log{ $src }{ $mode }{ 'warning: Illegal address syntax from + * in MAIL command: *' }; next; } if( $rest =~ m[warning: (?=.non-SMTP command)] ) { ++$log{ $src }{ $mode }{ 'warning: ** non-SMPT command from +' }; next; } $rest =~ s[ (?: [\w-]+ \. ){1,} [\w-]+][]gx; ## Remove fqdn +s $rest =~ s[ [a-z] \w+ \d : ][]gx; ## Server name +s? $rest =~ s[ [A-Z0-9]{11} : ][]x; ## Queue names $rest =~ s[ < [^>]+ > ][**]x; ## Common form + of bad name ++$log{ $src }{ $mode }{ $rest }; } [download] The results from the larger datasets: Read more... (5 kB) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: adaptive syslog message parsing by GrandFather (Saint) on Jun 07, 2007 at 00:51 UTC
Here's a start: use strict; use warnings; my $digest = bless {root => {}, maxLevel => 3}; $digest->add ($_) while <DATA>; $digest->mergeTails (); $digest->print (); sub add { my ($self, $line, $level, $context) = @_; $level \|\|= 1; $context \|\|= $self->{root}; if ($level == $self->{maxLevel} or $line !~ s/(\S?)\s\W\s+//) { push @{$context->{tails}}, $line; return; } my $prefix = $1; $context->{$prefix} \|\|= {}; $context = $context->{$prefix}; $self->add ($line, 1 + $level, $context); } sub mergeTails { my ($self, $context) = @_; $context \|\|= $self->{root}; unless (exists $context->{tails}) { $self->mergeTails ($context->{$_}) for keys %$context; return; } my @tails = sort {length $a <=> length $b} @{$context->{tails}}; my @groups; push @{$groups[length $_]}, $_ for @tails; @groups = grep {defined $_} @groups; for my $group (@groups) { my $mask = pop @$group; my $count = 1; while (@$group) { my $str = pop @$group; my $mix = $mask ^ $str; my $cpl = "\xff" x length $mix; $mix =~ tr/\0/\xff/c; $mix = $mix ^ $cpl; $mask = $mask & $mix; ++$count; } $mask =~ tr/\0//; push @{$context->{digest}}, [$mask, $count]; } } sub print { my ($self, $context, $indent) = @_; $context \|\|= $self->{root}; $indent \|\|= ''; if (exists $context->{digest}) { print "$indent($_->[1]) $_->[0]" for @{$context->{digest}}; return; } for (sort keys %$context) { print "$indent$_\n"; $self->print ($context->{$_}, $indent . ' '); } } __DATA__ [download] Read more... data per OP (3 kB) Prints: infocache02 ldap_cachemgr (1) Error: Unable to refresh from profile:tls_automount_profil +e. (error=1) (1) libsldap: Status: 91 Mesg: openConnection: simple bind fa +iled - Can't connect to the LDAP server sendmail (3) l****0*: Losing ./qfl***0*: savemail panic (2) l***0**: SYSERR(root): savemail: cannot save reject +ed email anywhere mail2-in postfix/smtpd (2) warning: 84.9.96.201: address not listed for hostname mail +.intechcentre.com (2) warning: 190.8.87.73: hostname din-190-8-87-73.manquehue.n +et verification failed: hostname nor servname provided, or not known (1) warning: 201.29.80.154: hostname 20129080154.user.veloxzon +e.com.br verification failed: hostname nor servname provided, or not +known (1) warning: 190.55.102.166: hostname cpe-190-55-102-166.telec +entro.com.ar verification failed: hostname nor servname provided, or +not known mail2-out ntpd (5) sendto(192.168.4.0): Bad file descriptor postfix/smtp (1) warning: valid_hostname: empty hostname (1) warning: malformed domain name in resource data of MX reco +rd for hotmil.com: postfix/smtp[32282] (1) warning: numeric domain name in resource data of MX record + for uyahoo.com: 10.0.0.2 [download] which doesn't quite digest the tails as you would like, but you could get a lot closer by dealing with matching runs of 'words' (`/(\S+)/`) rather than runs of characters. Algorithm::Diff would facilitate matching runs (left as an exercise for the reader). DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: adaptive syslog message parsing by GrandFather (Saint) on Jun 07, 2007 at 01:31 UTC
Ok, I couldn't resist! add: `use Algorithm::Diff;` [download] toward the start. In `sub add` change: `push @{$context->{tails}}, $line;` [download] to: `push @{$context->{tails}}, [$line =~ /(\S+)/g];` [download] In `sub mergeTails` replace everything after: `my @groups;` [download] with: `push @{$groups[@$_]}, $_ for @tails; @groups = grep {defined $_} @groups; for my $group (@groups) { my @ref = @{$group->[-1]}; my @org = @ref; my $count = 1; pop @$group; while (@$group) { my @new = @{pop @$group}; my @diffs = Algorithm::Diff::diff (\@ref, \@new); for my $change (@diffs) { next unless $change->[0][0] eq '-'; $ref[$change->[0][1]] = undef; } ++$count; } for (0 .. $#ref) { next if defined $ref[$_]; $org[$_] = '***'; } push @{$context->{digest}}, [join (' ', @org), $count]; }` [download] Now prints: infocache02 ldap_cachemgr (1) Error: Unable to refresh from profile:tls_automount_profil +e. (error=1) (1) libsldap: Status: 91 Mesg: openConnection: simple bind fai +led - Can't connect to the LDAP server sendmail (3) * Losing * savemail panic (2) * SYSERR(root): savemail: cannot save rejected email a +nywhere mail2-in postfix/smtpd (2) warning: 84.9.96.201: address not listed for hostname mail +.intechcentre.com (4) warning: * hostname * verification failed: hostnam +e nor servname provided, or not known mail2-out ntpd (5) *** Bad file descriptor postfix/smtp (1) warning: valid_hostname: empty hostname (1) warning: malformed domain name in resource data of MX reco +rd for hotmil.com: postfix/smtp[32282] (1) warning: numeric domain name in resource data of MX record + for uyahoo.com: 10.0.0.2 [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^3: adaptive syslog message parsing by Anonymous Monk on Jun 07, 2007 at 17:20 UTC
i admit, i lol'd when i read 'i couldn't resist..' i couldn't duplicate the output with the sample data using algorithm diff, it was similar but the new lines were off.. additionally, i have a more complete set of data that it doesn't output anything but one line (with the number 6 in parenthesis).. it looks pretty promising on the short set of sample data but i think it's confused with the big set of data (which happens to use fqdn instead of just hostname)	[reply]
Re^4: adaptive syslog message parsing by GrandFather (Saint) on Jun 07, 2007 at 19:54 UTC
Re: adaptive syslog message parsing by AK108 (Friar) on Jun 06, 2007 at 23:13 UTC
You'll probably want to use a regular expression to parse the log entry. Read more... (4 kB) The line `my($system, $subsystem, $message) = /(\S+)\s\-\s([^:]+):\s(.+)/;` takes advantage of $_. It then matches any non-space group (`(\S+)`, which becomes $system), any number of spaces (`\s`), a dash (`-`), any number of spaces (`\s`), anything until the colon (`([^:]+)`, which becomes $subsystem), the colon (`:`), any number of spaces (`\s`), and finally the rest of the string (`(.+)`, which becomes $message). This results in the following: Read more... (4 kB)	[reply] [d/l] [select]
Re^2: adaptive syslog message parsing by Anonymous Monk on Jun 06, 2007 at 23:33 UTC
i probably do want to use a regular expression.. :) unfortunately, that wasn't the part of the project i was having difficulty with.. where i'm stumped is trying to aggregate all messages of the same type (which requires the code to figure out which parts of the line are varying).. that part you haven't touched upon in your reply (but you did save me the work of writing the hash-building loop, albeit the easy part).. thanks so far	[reply]
Re^3: adaptive syslog message parsing by AK108 (Friar) on Jun 06, 2007 at 23:56 UTC
Yes, I used an HoHoH. The first dimension is the system, the second is the program (what I called the subsystem), and the third dimension is the message. The value of the third dimension gets the counts. Were you wanting something different than that? An array containing hashrefs might be an option, as it would preserve the initial order. Alternately, you could use a hash for each message string, and then have each value of that be an array, with each item representing a given instance of that message. There's lots of ways to implement this, data structure-wise, and it's usually easy to transform between them. Also, you might be interested in trying to parse out any dates, times, computer names, IP addresses, or anything else relevant to build a context for each message. You can also throw a `while(1) {...}` loop around it to continually read from the file (once you add the code to open/read/close, that is).	[reply] [d/l]
Re: adaptive syslog message parsing by thezip (Vicar) on Jun 07, 2007 at 00:15 UTC
Given that you could have many types of logged messages, with each having its own format specification, you might apply a regex to determine its type. Once an entry is classified, send it to an appropriate handler (subroutine) that knows how to parse that type of entry into its component parts, and then stuff the guts you care about into an appropriate data structure. The structure might look like: `a hash of servers a hash of daemon names a hash of messages (and their cumulative frequencies)` [download] For entries that do not classify to a handler you have coded, send these to an exception report (log file), and create the necessary handlers later as needed. Where do you want them* to go today?*	[reply] [d/l]
Re^2: adaptive syslog message parsing by neosamuri (Friar) on Jun 07, 2007 at 04:47 UTC
It seems to me the cumulative frequencies of message types(for each server, daemon name combination), would be less useful then the sequence of events. `hash{servers} hash{daemon names} array[$time, message] or $struct->{$server}->{$daemon}->[$message_id] = [$time,$message];` [download] Though for either method it would be nice to send a GD generated graph for each set server/daemon, which can give quick access to needed information.	[reply] [d/l]
Re^3: adaptive syslog message parsing by thezip (Vicar) on Jun 07, 2007 at 07:12 UTC
Yes, these are a nice enhancements, but they are not to the OP's specification which detailed that the statistical frequencies (ie. counts) be stored in the lowest hashes. I cannot comment as to which is better since I don't have the bigger picture of the problem. Where do you want them* to go today?*	[reply]
Re: adaptive syslog message parsing by otto (Beadle) on Jun 07, 2007 at 07:28 UTC
Maybe this doesn't exactly answer the question, but should maybe be asked... Personally I love writing code, but I'm much more efficient if I can beg, borrow, or well I guess I won't steal but you get the drift... First, since this is "from a variety of sources/daemons, in a variety of formats", I'd consider configuring the syslog to dump different sources of logs to different files. Then you limit the parsing issues to a particular topic... you should not have to deal with parsing mail log messages and dhcp messages in the same file. Second you note the volume and "who wants to read every line". I would suggest you consider something like rrdtool. It has many parsers for various kinds of log files and it makes pretty graphs. Even if you choose not to use rrdtool, you can grab the parsers, many of which are perl, and look at them for parsing each of the various formats in which you are interested... (Cavat... I have not used rrdtool yet, but am planning on it.) ..Otto	[reply]
Re^2: adaptive syslog message parsing by Anonymous Monk on Jun 07, 2007 at 17:53 UTC
thanks for the ideas otto.. unfortunately i have neither the access nor the authority to start modifying the syslog server itself since it's in production.. i'll have the server's architect take your suggestion into consideration, but it may cause regressive problems with other systems that rely on the way it's structured now (albeit i'll agree it isn't very well designed)... as far as rrdtool, i've used it a bit.. it definitely makes nice graphs, but it has more of a numerical-analysis application rather than dealing with text parsing.. it's, at its core, a backend db for storing data in a fixed amount of spare, along with extensions to create graphs from the db you populated with (usually) numerical measurements, counters, and the like..	[reply]
Re: adaptive syslog message parsing by ryanc (Monk) on Jun 07, 2007 at 18:11 UTC
what about tenshi? it is written in Perl and does almost exactly what you want, aside from the HTML page. the output it sends you via email is very similar to your desired HTML output.	[reply]
Re^2: adaptive syslog message parsing by Anonymous Monk on Jun 08, 2007 at 17:06 UTC
Thanks for the tip, not the poster, but installed it and after a few tweaks it's running fine. Useful tool. Thanks.	[reply]
Re: adaptive syslog message parsing by tracton (Initiate) on Jun 08, 2007 at 21:00 UTC
The problem with fuzzy matching is that you will not be able to assign meaning without considerable manual labor after the analysis phase, and the analysis will probably break lines into pieces a human would not consider reasonable. Instead, protect your head and consider the perl package "logwatch", which summarizes log files via email. It comes with ~60 service-specific filters to parse many unix log files. Unrecognized lines are simply passed through un-summarized, and then you'll know what filters you need to write/update. It will at least get you started and may help you choose a solution path.	[reply]
Re: adaptive syslog message parsing by telcontar (Beadle) on Jun 08, 2007 at 06:37 UTC
Have you considered using syslog-ng instead of syslog? You can do content filtering in quite flexible ways at a higher level, and you can create reusable configuration .. That way you can split which messages you really want to read into several files and throw away the rest. I'd imagine this approach would be easier to maintain than doing it with regexps.	[reply]
Re^2: adaptive syslog message parsing by Anonymous Monk on Jun 08, 2007 at 15:57 UTC
we actually do use syslog-ng on most hosts, all those entries were the result of syslog-ng logging.. the main goal is having a centralized place to look at it for all messages, informative and otherwise, and a way to cut down on the false positives.. by performing this "smart-regex" with summarization/generalization, hopefully it will give a singular viewpoint for how to respond to the ones that are found to actually need response (how severe an error is however can't be calculated by any program, and needs sysadmin intervention).. the last part is what my script aims to solve	[reply]