in reply to Re: adaptive syslog message parsing
in thread adaptive syslog message parsing

this is VERY promising and is doing almost exactly what i need, although i think it could be tweaked a little more to generalize the data a little (it outputs decent results on my more complete set of data, only it has many duplicates where messages are very similar.. )

it has also no problem with the fact that i'm normally using fqdn's instead of hosts (i only removed fqdn for brevity's sake), which is a testament to how resilient the regex is..

i need to study exactly how you did what you did so far.. any further help is most welcome too

Replies are listed 'Best First'.
Re^3: adaptive syslog message parsing
by BrowserUk (Patriarch) on Jun 07, 2007 at 22:33 UTC

    This is about as far as I think I would go. The remaining duplicates are where the user appears to supply random junk in place of commands or addresses. You could certainly add a few more special cases if they are frequent.

    The modified code:

    while( <> ) { next if /^\s*$/; ## Skip blank lines my( $src, $mode, $rest ) = m' ( ^ \S+ ) \s+ - \s+ ( [^\[:]+ ) (?: \[ \d+ \] )? : \s* ( .+ $ ) 'x; if( $rest =~ m[warning: (?=.*Illegal address syntax)] ) { ++$log{ $src }{ $mode }{ 'warning: Illegal address syntax from + **** in MAIL command: ****' }; next; } if( $rest =~ m[warning: (?=.*non-SMTP command)] ) { ++$log{ $src }{ $mode }{ 'warning: **** non-SMPT command from +****' }; next; } $rest =~ s[ (?: [\w-]+ \. ){1,} [\w-]+][****]gx; ## Remove fqdn +s $rest =~ s[ [a-z] \w+ \d : ][****]gx; ## Server name +s? $rest =~ s[ [A-Z0-9]{11} : ][****]x; ## Queue names $rest =~ s[ < [^>]+ > ][****]x; ## Common form + of bad name ++$log{ $src }{ $mode }{ $rest }; }

    The results from the larger datasets:


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.