in reply to adaptive syslog message parsing

This makes several assumptions based soley upon the sample data provided:

Updated: Simplified 1 regex and improved another.

#! perl -slw use strict; my %log; while( <DATA> ) { my( $src, $mode, $rest ) = m' ( ^ \S+ ) \s+ - \s+ ( [^\[:]+ ) (?: \[ \d+ \] )? : \s* ( .+ $ ) 'x; ## $rest =~ s[ (?: \S+ \. ){1,4} \S+ ][****]gx; $rest =~ s[ (?: [\w-]+ \. ){1,4} [\w-]+ ][****]gx; ## $rest =~ s[ [a-z] (?= [^:]* [A-Z] [^:\s]+ \d ) [^:\s]+ $rest =~ s[ [a-z] \w+ \d : ][****]gx; ++$log{ $src }{ $mode }{$rest}; } for my $src ( sort keys %log ) { print $src; for my $mode ( sort keys %{ $log{ $src } } ) { print " $mode"; print " ($log{ $src}{ $mode }{ $_ }) $_" for sort keys %{ $log{ $src}{ $mode } }; } } __DATA__ your sample data

Produces (after update):

C:\test>junk5 infocache02 ldap_cachemgr (1) Error: Unable to refresh from profile:tls_automount_profil +e. (error=1) (1) libsldap: Status: 91 Mesg: openConnection: simple bind fa +iled - Can't connect to the LDAP server sendmail (3) **** Losing ./**** savemail panic (2) **** SYSERR(root): savemail: cannot save rejected email an +ywhere mail2-in postfix/smtpd (2) warning: ****: address not listed for **** (4) warning: ****: hostname **** verification failed: hostname + nor servname provided, or not known mail2-out ntpd (5) sendto(****): Bad file descriptor postfix/smtp (1) warning: malformed domain name in resource data of MX reco +rd for ****: (1) warning: numeric domain name in resource data of MX record + for ****: **** (1) warning: valid_hostname: empty hostname

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: adaptive syslog message parsing
by Anonymous Monk on Jun 07, 2007 at 17:25 UTC
    this is VERY promising and is doing almost exactly what i need, although i think it could be tweaked a little more to generalize the data a little (it outputs decent results on my more complete set of data, only it has many duplicates where messages are very similar.. )

    it has also no problem with the fact that i'm normally using fqdn's instead of hosts (i only removed fqdn for brevity's sake), which is a testament to how resilient the regex is..

    i need to study exactly how you did what you did so far.. any further help is most welcome too

      This is about as far as I think I would go. The remaining duplicates are where the user appears to supply random junk in place of commands or addresses. You could certainly add a few more special cases if they are frequent.

      The modified code:

      while( <> ) { next if /^\s*$/; ## Skip blank lines my( $src, $mode, $rest ) = m' ( ^ \S+ ) \s+ - \s+ ( [^\[:]+ ) (?: \[ \d+ \] )? : \s* ( .+ $ ) 'x; if( $rest =~ m[warning: (?=.*Illegal address syntax)] ) { ++$log{ $src }{ $mode }{ 'warning: Illegal address syntax from + **** in MAIL command: ****' }; next; } if( $rest =~ m[warning: (?=.*non-SMTP command)] ) { ++$log{ $src }{ $mode }{ 'warning: **** non-SMPT command from +****' }; next; } $rest =~ s[ (?: [\w-]+ \. ){1,} [\w-]+][****]gx; ## Remove fqdn +s $rest =~ s[ [a-z] \w+ \d : ][****]gx; ## Server name +s? $rest =~ s[ [A-Z0-9]{11} : ][****]x; ## Queue names $rest =~ s[ < [^>]+ > ][****]x; ## Common form + of bad name ++$log{ $src }{ $mode }{ $rest }; }

      The results from the larger datasets:


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.