stevbutt has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, Wisdom is what I seek

I Have been trying to process several different log formats with some success but the mail ones have me a little stuck

Here is what my source data looks like

May 2 07:06:20 lon.mail.net exim[1234]: 2012-05-02 07:06:20 1PSPtU-00 +04en-1e <= it_ndt_bounces@new.itunes.com H=smtpmail.com [21.5.10.4] I +=[8.4.14.4]:25 P=esmtp S=1966 id=1603882764.112965659.1335927964793.M +ail.cboxp@ednabay.apple.com T="New on iTunes: One Thing And, Then Ano +ther, Cooking Apps,\n Great Deals on First Seasons, and M" May 2 07:06:20 lon.mail.net exim[1234]: 2012-05-02 07:06:20 1PSPtU-00 +04en-1e <= it_ndt_bounces@new.itunes.com H=smtpmail.com [21.5.10.4] I +=[8.4.14.4]:25 P=esmtp S=1966 id=1603882764.112965659.1335927964793.M +ail.cboxp@ednabay.apple.com T="New on iTunes: One Thing And, Then Ano +ther, Cooking Apps,\n Great Deals on First Seasons, and M" May 2 07:06:20 lon.mail.net exim[1235]: 2012-05-02 07:06:20 1PSPtU-00 +04en-1e => peterpiper <peterpiper@nosuchdomain.net> R=local_mail T=lo +cal_maildir_mail_drop

I have code now which processes basic syslog type entries into a number of fields

#!/usr/bin/perl use strict; use warnings; no warnings q{uninitialized}; while (my $line = <STDIN>) { chomp($line); my ( $mon, $day, $time, $loghost, $prog, $remainder ) = split m{:?\s+}, $line, 6; my %monthNos = do { my $no = 0; map { $_ => ++ $no } qw{ Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec }; }; my ( $user ) = $remainder =~ m{user=([^,]+)}; my ( $rip ) = $remainder =~ m{rip=([^,]+)}; $remainder =~ tr/"/'/; my $yr = q{2012}; my $csv = sprintf q{%02d/%02d/%s %s,%s,%s,"%s",%s,%s}, $day, $monthNos{ $mon }, $yr, $time, $loghost, $prog, $remainder, $ +user, $rip; print "$csv\n"; }

My problem now is that it looks like in exim various fields mean different things depending on whether the string contains <=, =>, == or even **

Because my files contain potentially millions of lines I am looking for an efficient way of effectively saying

if contains <= then ...... else if contains => then ..... else if contains == then .... else if contains ** then .... else somethingelse etc

Plus any tips if the way I am doing this now could be made faster

Many Thanks IA

Steve

Replies are listed 'Best First'.
Re: Process mail logs
by GrandFather (Saint) on Aug 12, 2012 at 23:35 UTC

    An interesting alternative if you need non-trivial processing for each operation is to use a dispatch table. Consider:

    #!/usr/bin/perl use strict; use warnings; use 5.010; my %ops = ('==' => \&doEq, '=>' => \&doGe, '<=' => \&doLe, '**' => \&d +oStar,); my $opMatch = join '|', map {qr{\Q$_\E}} keys %ops; while (my $line = <DATA>) { chomp($line); my ($op) = $line =~ m{\s($opMatch)\s}; if ($op) { $ops{$op}->($line); } else { print "No op found: $line\n"; } } sub doEq {my ($line) = @_; print "Do da ==: $line\n";} sub doLe {my ($line) = @_; print "Doing <=: $line\n";} sub doGe {my ($line) = @_; print "Doing =>: $line\n";} sub doStar {my ($line) = @_; print "Process **: $line\n";} __DATA__ 1234 <= inb@it.com 1234 <= inb@it.com 1235 => pp <pp@nsd.net> 1235 ++ pp <pp@nsd.net>

    Prints:

    Doing <=: 1234 <= inb@it.com Doing <=: 1234 <= inb@it.com Doing =>: 1235 => pp <pp@nsd.net> No op found: 1235 ++ pp <pp@nsd.net>
    True laziness is hard work

      Ok I think I have pieced together a solution but would appreciate one last ( hopefully ) piece of wisdom.

      the host section of the log can look like any of the following examples

      blah blah H=(10.21.32.43) [192.168.8.34] blah blah blah H=([10.21.32.43]) [192.168.8.34] blah blah blah H=mailsrvr.mail.com [192.168.8.34] blah

      The data I want to assign is the number in the square brackets i.e. 192.168.8.34

      How would I go about pattern matching to extract this i.e ip address in square brackets after space following initial text following H=

      Many Thanks, Steve

        If you plan on using Perl for more than a day or two I strongly recommend you read through the regular expression documentation provided with Perl (see perlretut, perlre and perlreref). Perl is strong on text processing and a large chunk of that comes from using regular expressions so understanding Perl's regular expression is important to writing good Perl code.

        For this particular match you could make it more or less fussy (like mathing the () part or not). A somewhat non-fussy match would be /H=[^[]* \[ ([^\]]+) \]/x. Note the use of the x flag to allow white space in the expression so it's easier to see the various moving parts.

        True laziness is hard work
Re: Process mail logs
by GrandFather (Saint) on Aug 12, 2012 at 23:24 UTC

    If you are using Perl 5.10 or newer then you can use the new given/when syntax to handle selecting between multiple code paths. However, anything of that nature is likely to be syntactic sugar (which may provide a programmer and maintainer efficiency gain) rather than an runtime improvement. No tinkering with code on that level is going to make any interesting runtime performance improvement however because any trivial change in runtime speed there will be completely hidden in I/O time.

    A very minor speed improvement can be made by moving %monthNos population out of the loop. Turning off warnings for the entire scope is a bad idea. Either do it locally, or better still fix the error. With those changes and using given/when your code looks like:

    #!/usr/bin/perl use strict; use warnings; use 5.010; my $no = 0; my %monthNos = map {$_ => ++$no} qw{ Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov +Dec }; while (my $line = <DATA>) { chomp($line); my ($mon, $day, $time, $loghost, $prog, $remainder) = split m{:?\s+}, $line, 6; my ($user) = $remainder =~ m{user=([^,]+)}; my ($rip) = $remainder =~ m{rip=([^,]+)}; my ($op) = $remainder =~ m{\s(==|<=|\*\*|\+\+)\s}; my $yr = 2012; $remainder =~ tr/"/'/; $_ //= '' for $user, $rip; given ($op) { when ('==') {print "Do da ==\n";} when ('<=') {print "Doing <=\n";} when ('**') {print "Process **\n";} when (undef) {print "No op found\n";} default {print "Duh! Can't do $op\n";} } my $csv = sprintf q{%02d/%02d/%s %s,%s,%s,"%s",%s,%s}, $day, $monthNos{$mon}, $yr, $time, $loghost, $prog, $remainder +, $user, $rip; print "$csv\n"; } __DATA__ May 2 07:06:20 l.net exim[1234]: 2012-05-02 07:06:20 1e <= inb@it.com May 2 07:06:20 l.net exim[1234]: 2012-05-02 07:06:20 1e <= inb@it.com May 2 07:06:20 l.net exim[1235]: 2012-05-02 07:06:20 1e => pp <pp@nsd. +net> May 2 07:06:20 l.net exim[1235]: 2012-05-02 07:06:20 1e ++ pp <pp@nsd. +net>

    Prints:

    Doing <= 02/05/2012 07:06:20,l.net,exim[1234],"2012-05-02 07:06:20 1e <= inb@it +.com ",, Doing <= 02/05/2012 07:06:20,l.net,exim[1234],"2012-05-02 07:06:20 1e <= inb@it +.com ",, No op found 02/05/2012 07:06:20,l.net,exim[1235],"2012-05-02 07:06:20 1e => pp <pp +@nsd.net> ",, Duh! Can't do ++ 02/05/2012 07:06:20,l.net,exim[1235],"2012-05-02 07:06:20 1e ++ pp <pp +@nsd.net> ",,
    True laziness is hard work

      Unfortunately this has to run on a stock Solaris 10 machine and it only comes with Perl 5.8.4, which I think means the "given" method will not work.

      Will the dispatch table method work with this older version of perl ?

Re: Process mail logs
by Kenosis (Priest) on Aug 12, 2012 at 23:17 UTC

    Perhaps Perl's (5.10+) given/when will work for you:

    use Modern::Perl; while ( my $string = <DATA> ) { given ($string) { when (/<=/) { say '<= was found.'; } when (/=>/) { say '=> was found.'; } when (/==/) { say '== was found.'; } when (/\*\*/) { say '** was found.'; } default { say "The following was found: $_"; } } } __DATA__ May 2 07:06:20 lon.mail.net exim[1234]: 2012-05-02 07:06:20 1PSPtU-00 +04en-1e <= it_ndt_bounces@new.itunes.com H=smtpmail.com [21.5.10.4] I +=[8.4.14.4]:25 P=esmtp S=1966 id=1603882764.112965659.1335927964793.M +ail.cboxp@ednabay.apple.com T="New on iTunes: One Thing And, Then Ano +ther, Cooking Apps,\n Great Deals on First Seasons, and M" May 2 07:06:20 lon.mail.net exim[1234]: 2012-05-02 07:06:20 1PSPtU-00 +04en-1e <= it_ndt_bounces@new.itunes.com H=smtpmail.com [21.5.10.4] I +=[8.4.14.4]:25 P=esmtp S=1966 id=1603882764.112965659.1335927964793.M +ail.cboxp@ednabay.apple.com T="New on iTunes: One Thing And, Then Ano +ther, Cooking Apps,\n Great Deals on First Seasons, and M" May 2 07:06:20 lon.mail.net exim[1235]: 2012-05-02 07:06:20 1PSPtU-00 +04en-1e => peterpiper <peterpiper@nosuchdomain.net> R=local_mail T=lo +cal_maildir_mail_drop May 2 07:06:20 lon.mail.net exim[1235]: 2012-05-02 07:06:20 1PSPtU-00 +04en-1e == peterpiper <peterpiper@nosuchdomain.net> R=local_mail T=lo +cal_maildir_mail_drop May 2 07:06:20 lon.mail.net exim[1234]: 2012-05-02 07:06:20 1PSPtU-00 +04en-1e ** it_ndt_bounces@new.itunes.com H=smtpmail.com [21.5.10.4] I +=[8.4.14.4]:25 P=esmtp S=1966 id=1603882764.112965659.1335927964793.M +ail.cboxp@ednabay.apple.com T="New on iTunes: One Thing And, Then Ano +ther, Cooking Apps,\n Great Deals on First Seasons, and M" May 2 07:06:20 lon.mail.net exim[1235]: 2012-05-02 07:06:20 1PSPtU-00 +04en-1e -- peterpiper <peterpiper@nosuchdomain.net> R=local_mail T=lo +cal_maildir_mail_drop

    Output:

    <= was found. <= was found. => was found. == was found. ** was found. The following was found: May 2 07:06:20 lon.mail.net exim[1235]: 2012 +-05-02 07:06:20 1PSPtU-0004en-1e -- peterpiper <peterpiper@nosuchdoma +in.net> R=local_mail T=local_maildir_mail_drop

    Hope this helps!

    Update: My thanks to influx for pointing me to an article in which the author encourages programmers to Use for() instead of given(). This may be especially relevant in this case, where millions of lines could be processed.

      Actually, you should try and refrain from using given/when. Brian D Foy explains why Here

      A pretty nice alternative I use, is

      for ($string) { # given if (/<=/) { .. } # when elsif (/=>/) { .. } # when else { .. } # default }

        Good call to address this given/when issue. Have updated my original post to reflect this...

      Unfortunately this has to run on a stock Solaris 10 machine and it only comes with Perl 5.8.4

      Is there another way ?

        Certainly. The given/when (or for/when) construct is cleaner, but there's no reason you can't fall back on a series of if/elsif/else. (Though I'd recommend looking at the dispatch table concept first.) In your sample data, the fields appear consistent up to the 'directional' field you're looking at, so you could:

        while(<DATA>){ chomp; my( $month, $day, $time, $host, $ppid, $date, $time2, $something, $d +irection, $remainder ) = split ' ', $_, 10; if( $direction eq '<=' ){ # go left } elsif( $direction eq '=>' ){ # go right } elsif( $direction eq '==' ){ # do a third thing } elsif( $direction eq '**' ){ # do a fourth thing } else { # fall back on a default behavior } }

        Aaron B.
        Available for small or large Perl jobs; see my home node.