in reply to Multi-Format Log Parser - Version 2.0

Neat stuff. ++ for using \"([^\"]*)\" instead of \"(.*?)\" that is all too often seen.

Bear in mind though, that strange User-Agent strings can break your regexp. Specifically, I once encountered "Slurp 1.0" (literally, with the quotes) as a user agent in my log file.

This was a real bugger to work around. I suppose a sufficiently well crafted regexp could extract foo from "foo" as well as bar from ""bar"". I solved the problem in a two-step process, by matching the prior fields, and then matching the latter fields, and then what was left was the user agent field. Keep in mind that ""user "foo" bar" could appear as a user agent. It gets icky.

--
g r i n d e r
print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u';

Replies are listed 'Best First'.
Re: Re: Multi-Format Log Parser - Version 2.0
by cjensen (Sexton) on Jan 18, 2002 at 23:05 UTC
    You're right about the improper use of quotes within a user agent string. That could cause pattern matches to fail, and those would be skipped. I'm thinking about adding an option to print log lines that don't match the currently selected format to STDERR, or a count of lines that didn't match. From using this on a fairly large web site, I know the patterns match our traffic fairly well, but it will be interesting to see how many lines don't match and why. I did a dump of counts per unique user agent string using this log parser a few days ago for our QA department and in one day's worth of logs there were 82,279 unique user agent strings. Our QA guys are after percentages of traffic per browser and platform, and I don't relish their job of parsing all the user agent strings to get that information since they don't follow any standardized format.
      I implemented a quick debug option that spits non-matches out to STDERR. In testing I found a pattern bug with byte counts of 304 log entries. Both are fixed in the following diff:
      26c26 < GetOptions (\%optctl, "type|t=s", "pattern|p=s"); --- > GetOptions (\%optctl, "type|t=s", "pattern|p=s", "debug|d=i"); 30,32c30,32 < 'common' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) (\d+)}, [qw(h l u t r c b)] ], < 'virtual' => [ qr{(\S+) (\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*) +\" (\d+) (\d+)}, [qw(v h l u t r c b)] ], < 'combined' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) (\d+) \"([^\"]*)\" \"([^\"]*)\"}, [qw(h l u t r c b R A)] ], --- > 'common' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) ([\d\-]+)}, [qw(h l u t r c b)] ], > 'virtual' => [ qr{(\S+) (\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*) +\" (\d+) ([\d\-]+)}, [qw(v h l u t r c b)] ], > 'combined' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) ([\d\-]+) \"([^\"]*)\" \"([^\"]*)\"}, [qw(h l u t r c b R A)] ], 35,36c35,36 < 'extended' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) (\d+) \"([^\"]*)\" \"([^\"]*)\" (\d+) (\d+)}, [qw(h l u t r c b R +A P T)] ], < 'custom' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) (\d+) \"([^\"]*)\" \"([^\"]*)\" (\d+)}, [qw(h l u t r c b A R T)] +], --- > 'extended' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) ([\d\-]+) \"([^\"]*)\" \"([^\"]*)\" (\d+) (\d+)}, [qw(h l u t r c +b R A P T)] ], > 'custom' => [ qr{(\S+) (\S+) (\S+) \[([^\]]*)\] \"([^\"]*)\" (\d ++) ([\d\-]+) \"([^\"]*)\" \"([^\"]*)\" (\d+)}, [qw(h l u t r c b A R +T)] ], 102a103,104 > } elsif ($optctl{debug} == 1) { > print STDERR $_;

      With the new patterns, a quick match against 79154 lines from an access log of 'extended' format had 8 lines which didn't match. All of them were because of quotes in the request or the user agent strings.

      Here's a user agent that didn't match...
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; <HTML><A% +20HREF="http://www.pghconnect.com/">www.pghconnect.com</a></HTML>)"