Somewhat basic but long, practical RE problem

nysus has asked for the wisdom of the Perl Monks concerning the following question:

I just finished writing a regular expression that parses an Apache log file in the "Transfer Log" format. This format has has 9 fields per line. If one of the fields is empty, Apache puts a hypen, "-" or -, in its place. Here's a sample Apache log file entry:

208.168.76.195 - - [27/Jun/2001:08:04:53 -0400] "GET /core.css HTTP/1.
+1" 304 - "http://www.progressivevalley.com/" "Mozilla/4.0 (compatible
+; MSIE 5.0; Windows 98; DigExt)"
217.50.206.61 - - [27/Jun/2001:08:19:54 -0400] "GET / HTTP/1.1" 200 21
+80 "-" "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)"
217.50.206.61 - - [27/Jun/2001:08:19:55 -0400] "GET /core.css HTTP/1.1
+" 200 146 "http://www.progressivevalley.com/" "Mozilla/4.0 (compatibl
+e; MSIE 6.0b; Windows NT 5.1)"
217.50.206.61 - - [27/Jun/2001:08:20:00 -0400] "GET /images/morningalt
+text.gif HTTP/1.1" 200 1424 "http://www.progressivevalley.com/" "Mozi
+lla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)"
217.50.206.61 - - [27/Jun/2001:08:20:00 -0400] "GET /images/logosun.jp
+g HTTP/1.1" 200 32232 "http://www.progressivevalley.com/" "Mozilla/4.
+0 (compatible; MSIE 6.0b; Windows NT 5.1)"
[download]

This should be enough info for you to write an RE on to parse each of the 9 fields. The speed of the RE is very important because the file can run very large. If you are beginner, try your hand and get some practice. If you are an expert, please feel free to show me and others how you would approach this problem. I'm curious to see if/how you do it differently. Here's how I did it (highlight the grey box on the following page to view):

    m/^([^\s]+)     # 1st field
    \s
    ([^\s]+)        # 2nd
    \s
    ([^\s]+)        # 3rd
    \s
    \[([^\]]+)\]    # 4th
    \s
    "([^"]+)"        # 5th
    \s
    ([^\s]+)        # etc.
    \s
    ([^\s]+)
    \s
    ([^\s]+)
    \s
    "([^"]+)"
    /x;

or in regular format:

m/^([^\s]+)\s([^\s]+)\s([^\s]+)\s\[([^\]]+)\]\s"([^"]+)"\s([^\s]+)\s([
+^\s]+)\s([^\s]+)\s"([^"]+)"/;

Note: if you have any doubts, the following RE was tested on an 18725 
+line file and successfully parsed all lines.
[download]

I know there is a module that parses Apache files. I also know this is a somewhat basic RE but it still proved fairly difficult for someone who has only written a few long REs so I think this can be pretty instructive for me and other new guys to Perl. Perhaps the one considered the best should be added to the Snippets section section if it's not already.

$PM = "Perl Monk's";
$MCF = "Most Clueless ~~Friar~~ Abbot";
$nysus = $PM . $MCF;
Click here if you love Perl Monks

Comment on Somewhat basic but long, practical RE problem Select or Download Code

Replies are listed 'Best First'.
Re: Somewhat basic but long, practical RE problem by bikeNomad (Priest) on Jul 02, 2001 at 05:43 UTC
Just a stylistic comment: you can use the `\S` where you use the equivalent `[^\s]`, giving you a somewhat less cluttered looking RE. And you don't need the backslash before the right square bracket outside the character class.	[reply] [d/l] [select]
Re: Somewhat basic but long, practical RE problem by Beatnik (Parson) on Jul 02, 2001 at 10:33 UTC
The Perl Cookbook has `while(<LOGFILE>) { my ($client,$identuser,$authuser,$date,$time,$tz,$method,$url,$protoco +l,$status,$bytes) = /^(\S+) (\S+) (\S+) \[([^: ]+):(\d+:\d+:\d+) ([^\ +]]+) "(\S+) (.*?) (\S+)" (\S+) (\S+)$/; #... }` [download] on page 726 (20.12 Parsing a Web Server Log File). Also check 20.13 (Processing Server Logs) for a snippet with format, or check Logfile::Apache. Cookbook examples are online here. Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply] [d/l]
Re: Re: Somewhat basic but long, practical RE problem by nysus (Parson) on Jul 02, 2001 at 10:40 UTC
I would have never thought to put in a plain old space character in a regular expression---I've just never even seen it done before. Interesting and thanks. Update: I should be putting these kinds of comments in the chatterbox, no? $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ Abbot"; $nysus = $PM . $MCF; Click here if you love Perl Monks	[reply]
Re: Somewhat basic but long, practical RE problem by lestrrat (Deacon) on Jul 02, 2001 at 10:59 UTC
I hope you don't take this in a bad way, but I think you wouldn't get much comments if the goal of this problem is to simply parse the Apache log. Let me put it this way: a "well formed" log file is really uninteresting to parse, since you don't expect bad entries in it. You know that the format for every line ( sans comment lines, if there are any ) are going to be in an identical format. Having that as a given, you don't really need to dig deep into the RE to come up with something that matches the lines. Why not make the problem: "Let's come up with an RE that matches anything conforming to the Apache log from any given text file". And each field must look like a valid entry for a log: for example, for IP addresses, you can't match just any string just because it comes in the beginning of the line and it's a non-white space character. It must look like an IP address. I think then you'll find people that may be interested in coming up with a new, improved RE. Here's something I came up with from the top of my head: DISCLAIMER: following code has NOT been tested, I don't even claim to know the apache log or the HTTP rfc all that well... bottom line, it's probably not a correct RE. Just an example # IP address ( sort of ) -- too lazy to come up with # a more elaborate RE.... I'm sure somebody knows of # a real RE for this :-) # assuming it's never empty (?:\d?\d?\d\.\d?\d?\d\.\d?\d?\d\.\d?\d?\d) # HTTP code # assuming it's never empty \d\d\d # bytes ( "-" if none ) (?:\d+\|-) # date. assuming it's never empty [ (?:[12][0-9]\|3[01]\|[1-9]) # date 10-19,20-29,30,31,1-9 / (?:J(?:un\|ul\|an)\|Feb\|Ma(?:r\|y)\|A(?:pr\|ug)\|Sep\|Oct\|Nov\|Dev) # month / \d\d\d\d # year : (?:[01][0-9]\|2[0-3]):(?:[0-5][0-9]):(?:[0-5][0-9]) # time ] # HTTP Request -- don't know what the RFC says about this, # so will stick with a simple one. # assuming it's never empty "[^"]+" # Referrer "(?:[^"]+\|-)" # I don't know what the second and third field does. (?:\S+\|-) [download] So putting that together... m/ ^ # beginning of line (?:\d?\d?\d\.\d?\d?\d\.\d?\d?\d\.\d?\d?\d) # pseudo IP address [ ] # delimiter (?:\S+\|-)[ ](?:\S+\|-) # second and third fields... [ ] # delimiter \[ (?:[12][0-9]\|3[01]\|[1-9]) # date 10-19,20-29,30,31,1-9 / (?:J(?:un\|ul\|an)\|Feb\|Ma(?:r\|y)\|A(?:pr\|ug)\|Sep\|Oct\|Nov\|Dev) # +month / \d\d\d\d # year : (?:[01][0-9]\|2[0-3]):(?:[0-5][0-9]):(?:[0-5][0-9]) # time \] [ ] # delimiter \d\d\d # HTTP code [ ] # delimiter (?:\d+\|-) # bytes [ ] # delimiter "[^"]+" # HTTP request [ ] # delimiter "(?:[^"]+\|-)" # referer $ # end /x; [download] As I said, just a thought. I'm sure there are bunch of things wrong with this RE. Feel free to point out any problems....	[reply] [d/l] [select]
Re: Somewhat basic but long, practical RE problem by nysus (Parson) on Jul 02, 2001 at 13:08 UTC
Important note to above "gray box" code. Well, I discovered a problem with the code. It appears that some double-quoted fields may have double-quotes within them. Given that, the RE in the gray box above misses a few entries. So taking the above comments into consideration, the following should work better: `m/^([\S]+) ([\S]+) ([\S]+) \[([^\]]+)] "(.+?)" ([\S]+) ([\S]+) "(.+?)" + "([^"]+)"$/` [download] I'm learning that this RE business is a lot messier than it is in the books. $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ Abbot"; $nysus = $PM . $MCF; Click here if you love Perl Monks	[reply] [d/l]
Re: Somewhat basic but long, practical RE problem by mischief (Hermit) on Jul 02, 2001 at 14:01 UTC
One way to do it might be to modify apache's config to use `\t` as a delimiter instead of a space, so then you could just use `split`.	[reply] [d/l] [select]

Important note to above "gray box" code.