Log Parsing using Regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I am working on a script which parses a set of log files. The goal is to extract the values of certain tags in the log files with capture groups. The data I am parsing looks similar to this:

2009/01/15 01:23:45:678: ASDF: [8=FIX.4.4^A9=228^A35=D^A49=ZYXW^A56=MY
+CO^A34=6^A52=20090115-01:23:45^A116=BLAH^A129=HALB^A50=MEH^A1=HEM^A11
+=abcefg123456^A15=ZZZ^A21=1^A22=5^A38=100^A40=2^A44=4.80000000^A48=ZV
+ZZT.N^A54=2^A55=ZVZZT^A59=0^A60=20090115-01:23:45^A100=MEH^A10=111^A]
[download]

Note that ^A represents the SOH character (Ascii val 1).

My goal is to be able to capture the value of any given tag. So far, I have tried this:

if($line =~ m/^A55=(.*?^A)/){
    print "$1|";
} else {
    print "|";
}
[download]

My output is a pipe delimited set of values. The above PERL works with the exception that each value in my output contains the "^A". I want to correct his by capturing just the value between "\d\d\d\d=" and the next "^A" (ungreedy) where every "d" is known. If no match is found (i.e. this tag is not present) I want to output just a "blank" pipe.

The second issue I would like to resolve is to simplify or generalize my statements. Currently I have a series of if statements, such as the one above, checking for every tag I'm trying to capture (e.g. 55=, 48=, 22=). I'd like to see if there is a more "clever" way to do this in a single statement. Something such as this, perhaps:

if($line =~ m/(^A22=.*?^A).*(^A40=.*?^A).*(^A48=.*?^A).*(^A54=.*?^A).*
+(^A55=.*?^A)/g){
    print "$1|$2|$3|$4|$5\n";
}
[download]

Please note that if one of the patterns above does not match, I'd like the corresponding $buffer variable to contain a blank rather than the next matched group value.

Thanks very much for your time and consideration,

Comment on Log Parsing using Regex Select or Download Code

Replies are listed 'Best First'.
Re: Log Parsing using Regex by moritz (Cardinal) on Jul 09, 2009 at 12:55 UTC
The above PERL works with the exception that each value in my output contains the "^A". The language is called "Perl", the interpreter is called "perl". PERL does not exists. Anyway, if you don't want to capture the ^A, don't put it inside the parenthesis: `if ($line =~ m/\x01 55= (.?) \x01/x) { ... }` [download] You can get rid of the non-greedy quantifier by using a char class instead: `if ($line =~ m/\x01 55= ([^\x01]) /x) { ... }` [download]	[reply] [d/l] [select]
Re^2: Log Parsing using Regex by Anonymous Monk on Jul 09, 2009 at 13:25 UTC
This is exactly what I was looking for. You also helped me gain a deeper understanding of how the character class works in perl regular expressions. Thanks very much!	[reply]
Re^3: Log Parsing using Regex by AnomalousMonk (Archbishop) on Jul 09, 2009 at 15:33 UTC
Note also that `\cA` is an alternate representation of 'control-A' both in and out of a character class. `>perl -wMstrict -le "my $s = qq{fee\x01fie\x01foe}; $s =~ s{ [\cA] }{--}xmsg; print $s; my $t = qq{biz\cAbaz\cAboz}; $t =~ s{ \x01 }{++}xmsg; print $t; " fee--fie--foe biz++baz++boz` [download]	[reply] [d/l] [select]
Re: Log Parsing using Regex by jethro (Monsignor) on Jul 09, 2009 at 13:22 UTC
I'm a bit surprised that this can work. Is the SOH character directly in the log file or is it just the string '^A' ? I assume the latter since your regex seems to try to check for that I said 'try' because a ^ has a special meaning, it matches the start of a string. You have to escape ^ to match your tags, so it is very astonishing that you say your pattern works It is no surprise that your matches (however you get them) contain the ^A when the parenthesis used to catch the number are also around the ^A A solution could be something like this: `while ($line =~ m/\^A(\d+)=(.*?)(?=\^A)/g){ my $tags{$1}= $2; } print $tags{55} \|\| ''; print '\|'; print $tags{22} \|\| ''; print '\|'; ...` [download] I assume that you only want a few specific tags to print, otherwise you should use an array to have a defined ordering of the tags and print the tags in a loop over that array Note the lookahead in the regex is necessary so that the following match is not stepping over every second tag	[reply] [d/l]
Re: Log Parsing using Regex by johngg (Canon) on Jul 09, 2009 at 22:55 UTC
This uses a hash a bit like jethro's solution but it is populated by split'ing the tags part of the line after the timestamp and tags string have been captured. I use the hex representation of [ and ] in the regular expression to avoid confusing escaping. I've added another line to the log with the '48=' tag missing to show the change in output. use strict; use warnings; my @wanted = qw{ 55 48 22 }; while ( <DATA> ) { my( $head, $tagStr ) = m{ (.*) \s+\x5b ([^\x5d]+) }x; print qq{Line : $head\n}; my %tags = map { split m{=} } split m{\x01}, $tagStr; print qq{Tags found:\n}, map { sprintf qq{ %-3s => %s\n}, $_, $tags{ $_ } } sort { $a <=> $b } keys %tags; print qq{Wanted : }, join q{\|}, map { exists $tags{ $_ } ? $tags{ $_ } : q{} } @wanted; print qq{\n===========\n}; } __END__ 2009/01/15 01:23:45:678: ASDF: [8=FIX.4.49=22835=D49=ZYXW56=MYCO3 +4=652=20090115-01:23:45116=BLAH129=HALB50=MEH1=HEM11=abcefg1234 +5615=ZZZ21=122=538=10040=244=4.8000000048=ZVZZT.N54=255=ZVZZ +T59=060=20090115-01:23:45100=MEH10=111] 2009/01/15 01:27:09:154: QWER: [8=FIX.4.49=22835=D49=ZYXW56=MYCO3 +4=652=20090115-01:23:45116=BLAH129=HALB50=MEH1=HEM11=abcefg1234 +5615=ZZZ21=122=538=10040=244=4.8000000054=255=ZVZZT59=060=2 +0090115-01:23:45100=MEH10=111] [download] The output. Line : 2009/01/15 01:23:45:678: ASDF: Tags found: 1 => HEM 8 => FIX.4.4 9 => 228 10 => 111 11 => abcefg123456 15 => ZZZ 21 => 1 22 => 5 34 => 6 35 => D 38 => 100 40 => 2 44 => 4.80000000 48 => ZVZZT.N 49 => ZYXW 50 => MEH 52 => 20090115-01:23:45 54 => 2 55 => ZVZZT 56 => MYCO 59 => 0 60 => 20090115-01:23:45 100 => MEH 116 => BLAH 129 => HALB Wanted : ZVZZT\|ZVZZT.N\|5 =========== Line : 2009/01/15 01:27:09:154: QWER: Tags found: 1 => HEM 8 => FIX.4.4 9 => 228 10 => 111 11 => abcefg123456 15 => ZZZ 21 => 1 22 => 5 34 => 6 35 => D 38 => 100 40 => 2 44 => 4.80000000 49 => ZYXW 50 => MEH 52 => 20090115-01:23:45 54 => 2 55 => ZVZZT 56 => MYCO 59 => 0 60 => 20090115-01:23:45 100 => MEH 116 => BLAH 129 => HALB Wanted : ZVZZT\|\|5 =========== [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Log Parsing using Regex by Anonymous Monk on Jul 21, 2009 at 12:24 UTC
Thanks to everyone! JohnGG - Your approach of spliting the line based on a regular expression proved to be the most efficient solution. The script took far less time to run and used less than half of the resources. Thanks again.	[reply]