Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I am working on a script which parses a set of log files. The goal is to extract the values of certain tags in the log files with capture groups. The data I am parsing looks similar to this:

2009/01/15 01:23:45:678: ASDF: [8=FIX.4.4^A9=228^A35=D^A49=ZYXW^A56=MY +CO^A34=6^A52=20090115-01:23:45^A116=BLAH^A129=HALB^A50=MEH^A1=HEM^A11 +=abcefg123456^A15=ZZZ^A21=1^A22=5^A38=100^A40=2^A44=4.80000000^A48=ZV +ZZT.N^A54=2^A55=ZVZZT^A59=0^A60=20090115-01:23:45^A100=MEH^A10=111^A]
Note that ^A represents the SOH character (Ascii val 1).

My goal is to be able to capture the value of any given tag. So far, I have tried this:

if($line =~ m/^A55=(.*?^A)/){ print "$1|"; } else { print "|"; }
My output is a pipe delimited set of values. The above PERL works with the exception that each value in my output contains the "^A". I want to correct his by capturing just the value between "\d\d\d\d=" and the next "^A" (ungreedy) where every "d" is known. If no match is found (i.e. this tag is not present) I want to output just a "blank" pipe.

The second issue I would like to resolve is to simplify or generalize my statements. Currently I have a series of if statements, such as the one above, checking for every tag I'm trying to capture (e.g. 55=, 48=, 22=). I'd like to see if there is a more "clever" way to do this in a single statement. Something such as this, perhaps:

if($line =~ m/(^A22=.*?^A).*(^A40=.*?^A).*(^A48=.*?^A).*(^A54=.*?^A).* +(^A55=.*?^A)/g){ print "$1|$2|$3|$4|$5\n"; }
Please note that if one of the patterns above does not match, I'd like the corresponding $buffer variable to contain a blank rather than the next matched group value.

Thanks very much for your time and consideration,

_j

Replies are listed 'Best First'.
Re: Log Parsing using Regex
by moritz (Cardinal) on Jul 09, 2009 at 12:55 UTC
    The above PERL works with the exception that each value in my output contains the "^A".

    The language is called "Perl", the interpreter is called "perl". PERL does not exists.

    Anyway, if you don't want to capture the ^A, don't put it inside the parenthesis:

    if ($line =~ m/\x01 55= (.*?) \x01/x) { ... }

    You can get rid of the non-greedy quantifier by using a char class instead:

    if ($line =~ m/\x01 55= ([^\x01]*) /x) { ... }
      This is exactly what I was looking for. You also helped me gain a deeper understanding of how the character class works in perl regular expressions.

      Thanks very much!

        Note also that  \cA is an alternate representation of 'control-A' both in and out of a character class.
        >perl -wMstrict -le "my $s = qq{fee\x01fie\x01foe}; $s =~ s{ [\cA] }{--}xmsg; print $s; my $t = qq{biz\cAbaz\cAboz}; $t =~ s{ \x01 }{++}xmsg; print $t; " fee--fie--foe biz++baz++boz
Re: Log Parsing using Regex
by jethro (Monsignor) on Jul 09, 2009 at 13:22 UTC

    I'm a bit surprised that this can work. Is the SOH character directly in the log file or is it just the string '^A' ? I assume the latter since your regex seems to try to check for that

    I said 'try' because a ^ has a special meaning, it matches the start of a string. You have to escape ^ to match your tags, so it is very astonishing that you say your pattern works

    It is no surprise that your matches (however you get them) contain the ^A when the parenthesis used to catch the number are also around the ^A

    A solution could be something like this:

    while ($line =~ m/\^A(\d+)=(.*?)(?=\^A)/g){ my $tags{$1}= $2; } print $tags{55} || ''; print '|'; print $tags{22} || ''; print '|'; ...

    I assume that you only want a few specific tags to print, otherwise you should use an array to have a defined ordering of the tags and print the tags in a loop over that array

    Note the lookahead in the regex is necessary so that the following match is not stepping over every second tag

Re: Log Parsing using Regex
by johngg (Canon) on Jul 09, 2009 at 22:55 UTC

    This uses a hash a bit like jethro's solution but it is populated by split'ing the tags part of the line after the timestamp and tags string have been captured. I use the hex representation of [ and ] in the regular expression to avoid confusing escaping. I've added another line to the log with the '48=' tag missing to show the change in output.

    use strict; use warnings; my @wanted = qw{ 55 48 22 }; while ( <DATA> ) { my( $head, $tagStr ) = m{ (.*) \s+\x5b ([^\x5d]+) }x; print qq{Line : $head\n}; my %tags = map { split m{=} } split m{\x01}, $tagStr; print qq{Tags found:\n}, map { sprintf qq{ %-3s => %s\n}, $_, $tags{ $_ } } sort { $a <=> $b } keys %tags; print qq{Wanted : }, join q{|}, map { exists $tags{ $_ } ? $tags{ $_ } : q{} } @wanted; print qq{\n===========\n}; } __END__ 2009/01/15 01:23:45:678: ASDF: [8=FIX.4.49=22835=D49=ZYXW56=MYCO3 +4=652=20090115-01:23:45116=BLAH129=HALB50=MEH1=HEM11=abcefg1234 +5615=ZZZ21=122=538=10040=244=4.8000000048=ZVZZT.N54=255=ZVZZ +T59=060=20090115-01:23:45100=MEH10=111] 2009/01/15 01:27:09:154: QWER: [8=FIX.4.49=22835=D49=ZYXW56=MYCO3 +4=652=20090115-01:23:45116=BLAH129=HALB50=MEH1=HEM11=abcefg1234 +5615=ZZZ21=122=538=10040=244=4.8000000054=255=ZVZZT59=060=2 +0090115-01:23:45100=MEH10=111]

    The output.

    Line : 2009/01/15 01:23:45:678: ASDF: Tags found: 1 => HEM 8 => FIX.4.4 9 => 228 10 => 111 11 => abcefg123456 15 => ZZZ 21 => 1 22 => 5 34 => 6 35 => D 38 => 100 40 => 2 44 => 4.80000000 48 => ZVZZT.N 49 => ZYXW 50 => MEH 52 => 20090115-01:23:45 54 => 2 55 => ZVZZT 56 => MYCO 59 => 0 60 => 20090115-01:23:45 100 => MEH 116 => BLAH 129 => HALB Wanted : ZVZZT|ZVZZT.N|5 =========== Line : 2009/01/15 01:27:09:154: QWER: Tags found: 1 => HEM 8 => FIX.4.4 9 => 228 10 => 111 11 => abcefg123456 15 => ZZZ 21 => 1 22 => 5 34 => 6 35 => D 38 => 100 40 => 2 44 => 4.80000000 49 => ZYXW 50 => MEH 52 => 20090115-01:23:45 54 => 2 55 => ZVZZT 56 => MYCO 59 => 0 60 => 20090115-01:23:45 100 => MEH 116 => BLAH 129 => HALB Wanted : ZVZZT||5 ===========

    I hope this is of interest.

    Cheers,

    JohnGG

      Thanks to everyone! JohnGG - Your approach of spliting the line based on a regular expression proved to be the most efficient solution. The script took far less time to run and used less than half of the resources.

      Thanks again.