Re^2: adaptive syslog message parsing

this is VERY promising and is doing almost exactly what i need, although i think it could be tweaked a little more to generalize the data a little (it outputs decent results on my more complete set of data, only it has many duplicates where messages are very similar.. )

it has also no problem with the fact that i'm normally using fqdn's instead of hosts (i only removed fqdn for brevity's sake), which is a testament to how resilient the regex is..

i need to study exactly how you did what you did so far.. any further help is most welcome too

Comment on Re^2: adaptive syslog message parsing

Replies are listed 'Best First'.
Re^3: adaptive syslog message parsing by BrowserUk (Patriarch) on Jun 07, 2007 at 22:33 UTC
This is about as far as I think I would go. The remaining duplicates are where the user appears to supply random junk in place of commands or addresses. You could certainly add a few more special cases if they are frequent. The modified code: while( <> ) { next if /^\s$/; ## Skip blank lines my( $src, $mode, $rest ) = m' ( ^ \S+ ) \s+ - \s+ ( [^\[:]+ ) (?: \[ \d+ \] )? : \s ( .+ $ ) 'x; if( $rest =~ m[warning: (?=.Illegal address syntax)] ) { ++$log{ $src }{ $mode }{ 'warning: Illegal address syntax from + * in MAIL command: *' }; next; } if( $rest =~ m[warning: (?=.non-SMTP command)] ) { ++$log{ $src }{ $mode }{ 'warning: ** non-SMPT command from +' }; next; } $rest =~ s[ (?: [\w-]+ \. ){1,} [\w-]+][]gx; ## Remove fqdn +s $rest =~ s[ [a-z] \w+ \d : ][]gx; ## Server name +s? $rest =~ s[ [A-Z0-9]{11} : ][]x; ## Queue names $rest =~ s[ < [^>]+ > ][**]x; ## Common form + of bad name ++$log{ $src }{ $mode }{ $rest }; } [download] The results from the larger datasets: Read more... (5 kB) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: adaptive syslog message parsing
by BrowserUk (Patriarch) on Jun 07, 2007 at 22:33 UTC

This is about as far as I think I would go. The remaining duplicates are where the user appears to supply random junk in place of commands or addresses. You could certainly add a few more special cases if they are frequent.

The modified code:

while( <> ) {
    next if /^\s*$/;        ## Skip blank lines
    my( $src, $mode, $rest ) =  m'
        ( ^ \S+ ) \s+ - \s+
        ( [^\[:]+ ) (?: \[ \d+ \] )? : \s*
        ( .+ $  )
    'x;
    if( $rest =~ m[warning: (?=.*Illegal address syntax)] ) {
        ++$log{ $src }{ $mode }{ 'warning: Illegal address syntax from
+ **** in MAIL command: ****' };
        next;
    }
    if( $rest =~ m[warning: (?=.*non-SMTP command)] ) {
        ++$log{ $src }{ $mode }{ 'warning: **** non-SMPT command from 
+****' };
        next;
    }
    $rest =~ s[ (?: [\w-]+ \. ){1,} [\w-]+][****]gx;    ## Remove fqdn
+s
    $rest =~ s[ [a-z] \w+ \d : ][****]gx;               ## Server name
+s?
    $rest =~ s[ [A-Z0-9]{11} : ][****]x;                ## Queue names
    $rest =~ s[ < [^>]+ > ][****]x;                     ## Common form
+ of bad name

    ++$log{ $src }{ $mode }{ $rest };
}
[download]

The results from the larger datasets: