Re: adaptive syslog message parsing
by BrowserUk (Patriarch) on Jun 06, 2007 at 23:30 UTC
|
This makes several assumptions based soley upon the sample data provided:
Updated: Simplified 1 regex and improved another.
#! perl -slw
use strict;
my %log;
while( <DATA> ) {
my( $src, $mode, $rest ) = m'
( ^ \S+ ) \s+ - \s+
( [^\[:]+ ) (?: \[ \d+ \] )? : \s*
( .+ $ )
'x;
## $rest =~ s[ (?: \S+ \. ){1,4} \S+ ][****]gx;
$rest =~ s[ (?: [\w-]+ \. ){1,4} [\w-]+ ][****]gx;
## $rest =~ s[ [a-z] (?= [^:]* [A-Z] [^:\s]+ \d ) [^:\s]+
$rest =~ s[ [a-z] \w+ \d : ][****]gx;
++$log{ $src }{ $mode }{$rest};
}
for my $src ( sort keys %log ) {
print $src;
for my $mode ( sort keys %{ $log{ $src } } ) {
print " $mode";
print " ($log{ $src}{ $mode }{ $_ }) $_"
for sort keys %{ $log{ $src}{ $mode } };
}
}
__DATA__
your sample data
Produces (after update): C:\test>junk5
infocache02
ldap_cachemgr
(1) Error: Unable to refresh from profile:tls_automount_profil
+e. (error=1)
(1) libsldap: Status: 91 Mesg: openConnection: simple bind fa
+iled - Can't connect to the LDAP server
sendmail
(3) **** Losing ./**** savemail panic
(2) **** SYSERR(root): savemail: cannot save rejected email an
+ywhere
mail2-in
postfix/smtpd
(2) warning: ****: address not listed for ****
(4) warning: ****: hostname **** verification failed: hostname
+ nor servname provided, or not known
mail2-out
ntpd
(5) sendto(****): Bad file descriptor
postfix/smtp
(1) warning: malformed domain name in resource data of MX reco
+rd for ****:
(1) warning: numeric domain name in resource data of MX record
+ for ****: ****
(1) warning: valid_hostname: empty hostname
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
this is VERY promising and is doing almost exactly what i need, although i think it could be tweaked a little more to generalize the data a little (it outputs decent results on my more complete set of data, only it has many duplicates where messages are very similar.. )
it has also no problem with the fact that i'm normally using fqdn's instead of hosts (i only removed fqdn for brevity's sake), which is a testament to how resilient the regex is..
i need to study exactly how you did what you did so far.. any further help is most welcome too
| [reply] |
|
|
while( <> ) {
next if /^\s*$/; ## Skip blank lines
my( $src, $mode, $rest ) = m'
( ^ \S+ ) \s+ - \s+
( [^\[:]+ ) (?: \[ \d+ \] )? : \s*
( .+ $ )
'x;
if( $rest =~ m[warning: (?=.*Illegal address syntax)] ) {
++$log{ $src }{ $mode }{ 'warning: Illegal address syntax from
+ **** in MAIL command: ****' };
next;
}
if( $rest =~ m[warning: (?=.*non-SMTP command)] ) {
++$log{ $src }{ $mode }{ 'warning: **** non-SMPT command from
+****' };
next;
}
$rest =~ s[ (?: [\w-]+ \. ){1,} [\w-]+][****]gx; ## Remove fqdn
+s
$rest =~ s[ [a-z] \w+ \d : ][****]gx; ## Server name
+s?
$rest =~ s[ [A-Z0-9]{11} : ][****]x; ## Queue names
$rest =~ s[ < [^>]+ > ][****]x; ## Common form
+ of bad name
++$log{ $src }{ $mode }{ $rest };
}
The results from the larger datasets:
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: adaptive syslog message parsing
by GrandFather (Saint) on Jun 07, 2007 at 00:51 UTC
|
use strict;
use warnings;
my $digest = bless {root => {}, maxLevel => 3};
$digest->add ($_) while <DATA>;
$digest->mergeTails ();
$digest->print ();
sub add {
my ($self, $line, $level, $context) = @_;
$level ||= 1;
$context ||= $self->{root};
if ($level == $self->{maxLevel} or $line !~ s/(\S*?)\s*\W\s+//) {
push @{$context->{tails}}, $line;
return;
}
my $prefix = $1;
$context->{$prefix} ||= {};
$context = $context->{$prefix};
$self->add ($line, 1 + $level, $context);
}
sub mergeTails {
my ($self, $context) = @_;
$context ||= $self->{root};
unless (exists $context->{tails}) {
$self->mergeTails ($context->{$_}) for keys %$context;
return;
}
my @tails = sort {length $a <=> length $b} @{$context->{tails}};
my @groups;
push @{$groups[length $_]}, $_ for @tails;
@groups = grep {defined $_} @groups;
for my $group (@groups) {
my $mask = pop @$group;
my $count = 1;
while (@$group) {
my $str = pop @$group;
my $mix = $mask ^ $str;
my $cpl = "\xff" x length $mix;
$mix =~ tr/\0/\xff/c;
$mix = $mix ^ $cpl;
$mask = $mask & $mix;
++$count;
}
$mask =~ tr/\0/*/;
push @{$context->{digest}}, [$mask, $count];
}
}
sub print {
my ($self, $context, $indent) = @_;
$context ||= $self->{root};
$indent ||= '';
if (exists $context->{digest}) {
print "$indent($_->[1]) $_->[0]" for @{$context->{digest}};
return;
}
for (sort keys %$context) {
print "$indent$_\n";
$self->print ($context->{$_}, $indent . ' ');
}
}
__DATA__
Prints:
infocache02
ldap_cachemgr
(1) Error: Unable to refresh from profile:tls_automount_profil
+e. (error=1)
(1) libsldap: Status: 91 Mesg: openConnection: simple bind fa
+iled - Can't connect to the LDAP server
sendmail
(3) l*******0*****: Losing ./qfl*******0*****: savemail panic
(2) l*******0*****: SYSERR(root): savemail: cannot save reject
+ed email anywhere
mail2-in
postfix/smtpd
(2) warning: 84.9.96.201: address not listed for hostname mail
+.intechcentre.com
(2) warning: 190.8.87.73: hostname din-190-8-87-73.manquehue.n
+et verification failed: hostname nor servname provided, or not known
(1) warning: 201.29.80.154: hostname 20129080154.user.veloxzon
+e.com.br verification failed: hostname nor servname provided, or not
+known
(1) warning: 190.55.102.166: hostname cpe-190-55-102-166.telec
+entro.com.ar verification failed: hostname nor servname provided, or
+not known
mail2-out
ntpd
(5) sendto(192.168.4.*0): Bad file descriptor
postfix/smtp
(1) warning: valid_hostname: empty hostname
(1) warning: malformed domain name in resource data of MX reco
+rd for hotmil.com:
postfix/smtp[32282]
(1) warning: numeric domain name in resource data of MX record
+ for uyahoo.com: 10.0.0.2
which doesn't quite digest the tails as you would like, but you could get a lot closer by dealing with matching runs of 'words' (/(\S+)/) rather than runs of characters. Algorithm::Diff would facilitate matching runs (left as an exercise for the reader).
DWIM is Perl's answer to Gödel
| [reply] [d/l] [select] |
|
|
Ok, I couldn't resist!
add:
use Algorithm::Diff;
toward the start. In sub add change: push @{$context->{tails}}, $line;
to:
push @{$context->{tails}}, [$line =~ /(\S+)/g];
In sub mergeTails replace everything after:
my @groups;
with:
push @{$groups[@$_]}, $_ for @tails;
@groups = grep {defined $_} @groups;
for my $group (@groups) {
my @ref = @{$group->[-1]};
my @org = @ref;
my $count = 1;
pop @$group;
while (@$group) {
my @new = @{pop @$group};
my @diffs = Algorithm::Diff::diff (\@ref, \@new);
for my $change (@diffs) {
next unless $change->[0][0] eq '-';
$ref[$change->[0][1]] = undef;
}
++$count;
}
for (0 .. $#ref) {
next if defined $ref[$_];
$org[$_] = '*****';
}
push @{$context->{digest}}, [join (' ', @org), $count];
}
Now prints:
infocache02
ldap_cachemgr
(1) Error: Unable to refresh from profile:tls_automount_profil
+e. (error=1)
(1) libsldap: Status: 91 Mesg: openConnection: simple bind fai
+led - Can't connect to the LDAP server
sendmail
(3) ***** Losing ***** savemail panic
(2) ***** SYSERR(root): savemail: cannot save rejected email a
+nywhere
mail2-in
postfix/smtpd
(2) warning: 84.9.96.201: address not listed for hostname mail
+.intechcentre.com
(4) warning: ***** hostname ***** verification failed: hostnam
+e nor servname provided, or not known
mail2-out
ntpd
(5) ***** Bad file descriptor
postfix/smtp
(1) warning: valid_hostname: empty hostname
(1) warning: malformed domain name in resource data of MX reco
+rd for hotmil.com:
postfix/smtp[32282]
(1) warning: numeric domain name in resource data of MX record
+ for uyahoo.com: 10.0.0.2
DWIM is Perl's answer to Gödel
| [reply] [d/l] [select] |
|
|
i admit, i lol'd when i read 'i couldn't resist..'
i couldn't duplicate the output with the sample data using algorithm diff, it was similar but the new lines were off..
additionally, i have a more complete set of data that it doesn't output anything but one line (with the number 6 in parenthesis).. it looks pretty promising on the short set of sample data but i think it's confused with the big set of data (which happens to use fqdn instead of just hostname)
| [reply] |
|
|
Re: adaptive syslog message parsing
by AK108 (Friar) on Jun 06, 2007 at 23:13 UTC
|
You'll probably want to use a regular expression to parse the log entry.
The line my($system, $subsystem, $message) = /(\S+)\s*\-\s*([^:]+):\s*(.+)/; takes advantage of $_. It then matches any non-space group ((\S+), which becomes $system), any number of spaces (\s*), a dash (-), any number of spaces (\s*), anything until the colon (([^:]+), which becomes $subsystem), the colon (:), any number of spaces (\s*), and finally the rest of the string ((.+), which becomes $message).
This results in the following:
| [reply] [d/l] [select] |
|
|
i probably do want to use a regular expression.. :) unfortunately, that wasn't the part of the project i was having difficulty with..
where i'm stumped is trying to aggregate all messages of the same type (which requires the code to figure out which parts of the line are varying).. that part you haven't touched upon in your reply (but you did save me the work of writing the hash-building loop, albeit the easy part).. thanks so far
| [reply] |
|
|
Yes, I used an HoHoH. The first dimension is the system, the second is the program (what I called the subsystem), and the third dimension is the message. The value of the third dimension gets the counts.
Were you wanting something different than that? An array containing hashrefs might be an option, as it would preserve the initial order. Alternately, you could use a hash for each message string, and then have each value of that be an array, with each item representing a given instance of that message. There's lots of ways to implement this, data structure-wise, and it's usually easy to transform between them.
Also, you might be interested in trying to parse out any dates, times, computer names, IP addresses, or anything else relevant to build a context for each message. You can also throw a while(1) {...} loop around it to continually read from the file (once you add the code to open/read/close, that is).
| [reply] [d/l] |
Re: adaptive syslog message parsing
by thezip (Vicar) on Jun 07, 2007 at 00:15 UTC
|
Given that you could have many types of logged messages, with each having its own format specification, you might apply a regex to determine its type. Once an entry is classified, send it to an appropriate handler (subroutine) that knows how to parse that type of entry into its component parts, and then stuff the guts you care about into an appropriate data structure.
The structure might look like:
a hash of servers
a hash of daemon names
a hash of messages (and their cumulative frequencies)
For entries that do not classify to a handler you have coded, send these to an exception report (log file), and create the necessary handlers later as needed.
Where do you want *them* to go today?
| [reply] [d/l] |
|
|
hash{servers}
hash{daemon names}
array[$time, message]
or
$struct->{$server}->{$daemon}->[$message_id] = [$time,$message];
Though for either method it would be nice to send a GD generated graph for each set server/daemon, which can give quick access to needed information. | [reply] [d/l] |
|
|
Yes, these are a nice enhancements, but they are not to the OP's specification which detailed that the statistical frequencies (ie. counts) be stored in the lowest hashes.
I cannot comment as to which is better since I don't have the bigger picture of the problem.
Where do you want *them* to go today?
| [reply] |
Re: adaptive syslog message parsing
by otto (Beadle) on Jun 07, 2007 at 07:28 UTC
|
Maybe this doesn't exactly answer the question, but should maybe be asked... Personally I love writing code, but I'm much more efficient if I can beg, borrow, or well I guess I won't steal but you get the drift... First, since this is "from a variety of sources/daemons, in a variety of formats", I'd consider configuring the syslog to dump different sources of logs to different files. Then you limit the parsing issues to a particular topic... you should not have to deal with parsing mail log messages and dhcp messages in the same file. Second you note the volume and "who wants to read every line". I would suggest you consider something like rrdtool. It has many parsers for various kinds of log files and it makes pretty graphs. Even if you choose not to use rrdtool, you can grab the parsers, many of which are perl, and look at them for parsing each of the various formats in which you are interested... (Cavat... I have not used rrdtool yet, but am planning on it.) ..Otto
| [reply] |
|
|
thanks for the ideas otto.. unfortunately i have neither the access nor the authority to start modifying the syslog server itself since it's in production.. i'll have the server's architect take your suggestion into consideration, but it may cause regressive problems with other systems that rely on the way it's structured now (albeit i'll agree it isn't very well designed)...
as far as rrdtool, i've used it a bit.. it definitely makes nice graphs, but it has more of a numerical-analysis application rather than dealing with text parsing.. it's, at its core, a backend db for storing data in a fixed amount of spare, along with extensions to create graphs from the db you populated with (usually) numerical measurements, counters, and the like..
| [reply] |
Re: adaptive syslog message parsing
by ryanc (Monk) on Jun 07, 2007 at 18:11 UTC
|
what about tenshi?
it is written in Perl and does almost exactly what you want, aside from the HTML page. the output it sends you via email is very similar to your desired HTML output.
| [reply] |
|
|
Thanks for the tip, not the poster, but installed it and after a few tweaks it's running fine. Useful tool.
Thanks.
| [reply] |
Re: adaptive syslog message parsing
by tracton (Initiate) on Jun 08, 2007 at 21:00 UTC
|
The problem with fuzzy matching is that you will not be able to assign meaning without considerable manual labor after the analysis phase, and the analysis will probably break lines into pieces a human would not consider reasonable.
Instead, protect your head and consider the perl package "logwatch", which summarizes log files via email. It comes with ~60 service-specific filters to parse many unix log files. Unrecognized lines are simply passed through un-summarized, and then you'll know what filters you need to write/update.
It will at least get you started and may help you choose a solution path.
| [reply] |
Re: adaptive syslog message parsing
by telcontar (Beadle) on Jun 08, 2007 at 06:37 UTC
|
Have you considered using syslog-ng instead of syslog? You can do content filtering in quite flexible ways at a higher level, and you can create reusable configuration .. That way you can split which messages you really *want* to read into several files and throw away the rest. I'd imagine this approach would be easier to maintain than doing it with regexps.
| [reply] |
|
|
we actually do use syslog-ng on most hosts, all those entries were the result of syslog-ng logging..
the main goal is having a centralized place to look at it for all messages, informative and otherwise, and a way to cut down on the false positives.. by performing this "smart-regex" with summarization/generalization, hopefully it will give a singular viewpoint for how to respond to the ones that are found to actually need response (how severe an error is however can't be calculated by any program, and needs sysadmin intervention).. the last part is what my script aims to solve
| [reply] |