I hope you don't take this in a bad way, but I think you wouldn't get much comments if the goal of this problem is to simply parse the Apache log.

Let me put it this way: a "well formed" log file is really uninteresting to parse, since you don't expect bad entries in it. You know that the format for every line ( sans comment lines, if there are any ) are going to be in an identical format. Having that as a given, you don't really need to dig deep into the RE to come up with something that matches the lines.

Why not make the problem: "Let's come up with an RE that matches anything conforming to the Apache log from any given text file". And each field must look like a valid entry for a log: for example, for IP addresses, you can't match just any string just because it comes in the beginning of the line and it's a non-white space character. It must look like an IP address.

I think then you'll find people that may be interested in coming up with a new, improved RE.

Here's something I came up with from the top of my head:

DISCLAIMER: following code has NOT been tested, I don't even claim to know the apache log or the HTTP rfc all that well... bottom line, it's probably not a correct RE. Just an example
# IP address ( sort of ) -- too lazy to come up with # a more elaborate RE.... I'm sure somebody knows of # a *real* RE for this :-) # assuming it's never empty (?:\d?\d?\d\.\d?\d?\d\.\d?\d?\d\.\d?\d?\d) # HTTP code # assuming it's never empty \d\d\d # bytes ( "-" if none ) (?:\d+|-) # date. assuming it's never empty [ (?:[12][0-9]|3[01]|[1-9]) # date 10-19,20-29,30,31,1-9 / (?:J(?:un|ul|an)|Feb|Ma(?:r|y)|A(?:pr|ug)|Sep|Oct|Nov|Dev) # month / \d\d\d\d # year : (?:[01][0-9]|2[0-3]):(?:[0-5][0-9]):(?:[0-5][0-9]) # time ] # HTTP Request -- don't know what the RFC says about this, # so will stick with a simple one. # assuming it's never empty "[^"]+" # Referrer "(?:[^"]+|-)" # I don't know what the second and third field does. (?:\S+|-)

So putting that together...

m/ ^ # beginning of line (?:\d?\d?\d\.\d?\d?\d\.\d?\d?\d\.\d?\d?\d) # pseudo IP address [ ] # delimiter (?:\S+|-)[ ](?:\S+|-) # second and third fields... [ ] # delimiter \[ (?:[12][0-9]|3[01]|[1-9]) # date 10-19,20-29,30,31,1-9 / (?:J(?:un|ul|an)|Feb|Ma(?:r|y)|A(?:pr|ug)|Sep|Oct|Nov|Dev) # +month / \d\d\d\d # year : (?:[01][0-9]|2[0-3]):(?:[0-5][0-9]):(?:[0-5][0-9]) # time \] [ ] # delimiter \d\d\d # HTTP code [ ] # delimiter (?:\d+|-) # bytes [ ] # delimiter "[^"]+" # HTTP request [ ] # delimiter "(?:[^"]+|-)" # referer $ # end /x;

As I said, just a thought. I'm sure there are bunch of things wrong with this RE. Feel free to point out any problems....


In reply to Re: Somewhat basic but long, practical RE problem by lestrrat
in thread Somewhat basic but long, practical RE problem by nysus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.