I hope you don't take this in a bad way, but I think you wouldn't get much comments if the goal of this problem is to simply parse the Apache log.
Let me put it this way: a "well formed" log file is really uninteresting to parse, since you don't expect bad entries in it. You know that the format for every line ( sans comment lines, if there are any ) are going to be in an identical format. Having that as a given, you don't really need to dig deep into the RE to come up with something that matches the lines.
Why not make the problem: "Let's come up with an RE that matches anything conforming to the Apache log from any given text file". And each field must look like a valid entry for a log: for example, for IP addresses, you can't match just any string just because it comes in the beginning of the line and it's a non-white space character. It must look like an IP address.
I think then you'll find people that may be interested in coming up with a new, improved RE.
Here's something I came up with from the top of my head:
# IP address ( sort of ) -- too lazy to come up with # a more elaborate RE.... I'm sure somebody knows of # a *real* RE for this :-) # assuming it's never empty (?:\d?\d?\d\.\d?\d?\d\.\d?\d?\d\.\d?\d?\d) # HTTP code # assuming it's never empty \d\d\d # bytes ( "-" if none ) (?:\d+|-) # date. assuming it's never empty [ (?:[12][0-9]|3[01]|[1-9]) # date 10-19,20-29,30,31,1-9 / (?:J(?:un|ul|an)|Feb|Ma(?:r|y)|A(?:pr|ug)|Sep|Oct|Nov|Dev) # month / \d\d\d\d # year : (?:[01][0-9]|2[0-3]):(?:[0-5][0-9]):(?:[0-5][0-9]) # time ] # HTTP Request -- don't know what the RFC says about this, # so will stick with a simple one. # assuming it's never empty "[^"]+" # Referrer "(?:[^"]+|-)" # I don't know what the second and third field does. (?:\S+|-)
So putting that together...
m/ ^ # beginning of line (?:\d?\d?\d\.\d?\d?\d\.\d?\d?\d\.\d?\d?\d) # pseudo IP address [ ] # delimiter (?:\S+|-)[ ](?:\S+|-) # second and third fields... [ ] # delimiter \[ (?:[12][0-9]|3[01]|[1-9]) # date 10-19,20-29,30,31,1-9 / (?:J(?:un|ul|an)|Feb|Ma(?:r|y)|A(?:pr|ug)|Sep|Oct|Nov|Dev) # +month / \d\d\d\d # year : (?:[01][0-9]|2[0-3]):(?:[0-5][0-9]):(?:[0-5][0-9]) # time \] [ ] # delimiter \d\d\d # HTTP code [ ] # delimiter (?:\d+|-) # bytes [ ] # delimiter "[^"]+" # HTTP request [ ] # delimiter "(?:[^"]+|-)" # referer $ # end /x;
As I said, just a thought. I'm sure there are bunch of things wrong with this RE. Feel free to point out any problems....
In reply to Re: Somewhat basic but long, practical RE problem
by lestrrat
in thread Somewhat basic but long, practical RE problem
by nysus
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |