in reply to adaptive syslog message parsing

You'll probably want to use a regular expression to parse the log entry.
#!/usr/bin/perl # Script to parse logfile as specified in PerlMonks node 619685 use strict; use warnings; my %log; while(<DATA>) { # This is the interesting line, I'll explain it below my($system, $subsystem, $message) = /(\S+)\s*\-\s*([^:]+):\s*(.+)/ +; $log{$system}{$subsystem}{$message}++; } foreach my $system (sort keys %log) { print "$system:\n"; foreach my $subsystem (sort keys %{$log{$system}}) { print "\t$subsystem\n"; foreach my $message (sort keys %{$log{$system}{$subsystem}}) { print "\t\t($log{$system}{$subsystem}{$message}) $message\ +n"; } } print "\n"; } __DATA__ mail2-out - ntpd: sendto(192.168.4.10): Bad file descriptor mail2-out - postfix/smtp: warning: valid_hostname: empty hostname mail2-out - postfix/smtp: warning: malformed domain name in resource d +ata of MX record for hotmil.com: mail2-out - ntpd: sendto(192.168.4.20): Bad file descriptor mail2-out - ntpd: sendto(192.168.4.10): Bad file descriptor mail2-out - postfix/smtp[32282]: warning: numeric domain name in resou +rce data of MX record for uyahoo.com: 10.0.0.2 mail2-out - ntpd: sendto(192.168.4.20): Bad file descriptor mail2-out - ntpd: sendto(192.168.4.10): Bad file descriptor infocache02 - ldap_cachemgr: libsldap: Status: 91 Mesg: openConnectio +n: simple bind failed - Can't connect to the LDAP server infocache02 - ldap_cachemgr: Error: Unable to refresh from profile:tls +_automount_profile. (error=1) infocache02 - sendmail: l560aB7V017120: Losing ./qfl560aB7V017120: sav +email panic infocache02 - sendmail: l560aB7V017120: SYSERR(root): savemail: cannot + save rejected email anywhere infocache02 - sendmail: l1FM2rFa026352: Losing ./qfl1FM2rFa026352: sav +email panic infocache02 - sendmail: l1FM2rFa026352: SYSERR(root): savemail: cannot + save rejected email anywhere infocache02 - sendmail: l1FI2rFa022597: Losing ./qfl1FI2rFa022597: sav +email panic mail2-in - postfix/smtpd: warning: 190.55.102.166: hostname cpe-190-55 +-102-166.telecentro.com.ar verification failed: hostname nor servname + provided, or not known mail2-in - postfix/smtpd: warning: 201.29.80.154: hostname 20129080154 +.user.veloxzone.com.br verification failed: hostname nor servname pro +vided, or not known mail2-in - postfix/smtpd: warning: 84.9.96.201: address not listed for + hostname mail.intechcentre.com mail2-in - postfix/smtpd: warning: 84.9.96.201: address not listed for + hostname mail.intechcentre.com mail2-in - postfix/smtpd: warning: 190.8.87.73: hostname din-190-8-87- +73.manquehue.net verification failed: hostname nor servname provided, + or not known mail2-in - postfix/smtpd: warning: 190.8.87.73: hostname din-190-8-87- +73.manquehue.net verification failed: hostname nor servname provided, + or not known
The line my($system, $subsystem, $message) = /(\S+)\s*\-\s*([^:]+):\s*(.+)/; takes advantage of $_. It then matches any non-space group ((\S+), which becomes $system), any number of spaces (\s*), a dash (-), any number of spaces (\s*), anything until the colon (([^:]+), which becomes $subsystem), the colon (:), any number of spaces (\s*), and finally the rest of the string ((.+), which becomes $message). This results in the following:
infocache02: ldap_cachemgr (1) Error: Unable to refresh from profile:tls_automoun +t_profile. (error=1) (1) libsldap: Status: 91 Mesg: openConnection: simple + bind fail ed - Can't connect to the LDAP server sendmail (1) l1FI2rFa022597: Losing ./qfl1FI2rFa022597: savemai +l panic (1) l1FM2rFa026352: Losing ./qfl1FM2rFa026352: savemai +l panic (1) l1FM2rFa026352: SYSERR(root): savemail: cannot sav +e rejected email anywhere (1) l560aB7V017120: Losing ./qfl560aB7V017120: savemai +l panic (1) l560aB7V017120: SYSERR(root): savemail: cannot sav +e rejected email anywhere mail2-in: postfix/smtpd (1) warning: 190.55.102.166: hostname cpe-190-55-102-1 +66.telecen + tro.com.ar verification failed: +hostname nor servname provided, or not known (2) warning: 190.8.87.73: hostname din-190-8-87-73.man +quehue.net + verification failed: hostname n +or servname provided, or not known (1) warning: 201.29.80.154: hostname 20129080154.user. +veloxzone. + com.br verification failed: host +name nor servname provided, or not known (2) warning: 84.9.96.201: address not listed for hostn +ame mail.i + ntechcentre.com mail2-out: ntpd (3) sendto(192.168.4.10): Bad file descriptor (2) sendto(192.168.4.20): Bad file descriptor postfix/smtp (1) warning: malformed domain name in resource data of + MX record + for hotmil.com: (1) warning: valid_hostname: empty hostname postfix/smtp[32282] (1) warning: numeric domain name in resource data of M +X record f + or uyahoo.com: 10.0.0.2

Replies are listed 'Best First'.
Re^2: adaptive syslog message parsing
by Anonymous Monk on Jun 06, 2007 at 23:33 UTC
    i probably do want to use a regular expression.. :) unfortunately, that wasn't the part of the project i was having difficulty with..

    where i'm stumped is trying to aggregate all messages of the same type (which requires the code to figure out which parts of the line are varying).. that part you haven't touched upon in your reply (but you did save me the work of writing the hash-building loop, albeit the easy part).. thanks so far

      Yes, I used an HoHoH. The first dimension is the system, the second is the program (what I called the subsystem), and the third dimension is the message. The value of the third dimension gets the counts.

      Were you wanting something different than that? An array containing hashrefs might be an option, as it would preserve the initial order. Alternately, you could use a hash for each message string, and then have each value of that be an array, with each item representing a given instance of that message. There's lots of ways to implement this, data structure-wise, and it's usually easy to transform between them.

      Also, you might be interested in trying to parse out any dates, times, computer names, IP addresses, or anything else relevant to build a context for each message. You can also throw a while(1) {...} loop around it to continually read from the file (once you add the code to open/read/close, that is).