arturo has asked for the wisdom of the Perl Monks concerning the following question:

Parsing an Apache logfile. Yes, I know that there is such a thing as Apache::ParseLog but since it's OO, I have efficiency worries and anyhow, the real point of this question is for me to hone some regex skills if possible. The problem: my current method looks like this:

while (<LOGFILE>) { my $hit = parse_line($_); #do a bunch of stuff with the hashref, like # insert it into a DB. } sub parse_line { my $line = shift; if ($line =~ /^(\S+).*?\[(\S+).*?] (\S+) "([^"]+)" (\d+)/) { return {host_ip=>$1, timestamp=>$2, vhost=>$3, request=>$4, HTTP_CODE=>$5 }; # and some stuff to handle errors, that needn't bother us }
The problem is that (l)users are sometimes entering something silly into their location windows, such as
"http://www.foo.edu"
(that's right, quotes and all; which, as you can see, messes with my regex, specifically, the parens associated with the fourth and fifth match variables.)

Now I *can* throw them away, and given that these files are taking hours to process (the DB inserts take a while), I'd like to know whether it's worth my while to attempt to handle the silly URLS or whether I should just forget about them as this sub processes over 2 million lines a run and I'd like to make it as lean as possible. How expensive in terms of processing time would it be to craft a smarter regex?

yes, I will deal with that dot-star in there ...

Sample goofy line from log 1.2.3.4 - - [10/Oct/2000:00:19:13 -0400] www.foo.edu "GET /"http://www.TheCounter.com/" HTTP/1.1" 404 2925 i.e. ipaddress, username, realm, timestamp, virtual host name, "request string", http code

Philosophy can be made out of anything -- or less

Replies are listed 'Best First'.
Re: Crafting a regex
by merlyn (Sage) on Oct 11, 2000 at 21:12 UTC
    Yes, I know that there is such a thing as Apache::ParseLog but since it's OO, I have efficiency worries
    The overhead to call a method is about twice that of a normal subroutine call, if I recall "The Damian"'s benchmarks on it. You are optimizing a lot of stuff that could better be served by using a module. Please stop reinventing an inefficient wheel.

    At a minimum, look inside the module source (it's free!) to see how they solved it.

    -- Randal L. Schwartz, Perl hacker

      Interesting suggestion. Poking around inside the innards of the module, I find

      my($url_rx) = '(\\S+)'; # %U (url, URL)

      Which isn't going to solve the problem; not that I would expect it to, actually. "the problem with making anything so foolproof is that fools are so (consarned) ingenious!"

      That's it, then, those lines are going on the junkheap =)

      Philosophy can be made out of anything -- or less

Re: Crafting a regex
by ahunter (Monk) on Oct 11, 2000 at 21:29 UTC
    Perl is fairly good at compiling regexps, and provided you only use the constructs that can be simulated with a 'standard' regexp (which is to say, provided you can construct a deterministic finite automata from the expression), the running time should be O(n), where n is the length of the string being matched against.

    This means you should avoid using anything that might cause perl to backtrack or lookahead - mostly that means the (?...) operators.

    Anyhow, from your example, it looks like the closing quote is always the last one, so you could just take advantage of perl's greediness and use the following regexp, which should run at exactly the same speed as your original:

    /^(\S+).*?\[(\S+).*?] (\S+) "(.+)" (\d+)/
    Andrew.
Re: Crafting a regex
by wardk (Deacon) on Oct 11, 2000 at 21:14 UTC

    You could just strip any/all quotes from $line just after the shift and modify the regex to not look for quotes. Since most of the processing is with the database, I wouldn't think adding another simple regex to strip the quotes completely would cause much trouble.

Re: Crafting a regex
by mirod (Canon) on Oct 11, 2000 at 21:26 UTC

    I guess there is no generic way to grab reliably the location part as it can include quotes, quotes followed by spaces, HTTP codes...
    So the answer is probably to grab the rest, which has a fixed format and then consider the location to be anything else:

    m/^\d+\d+\d+\d+ [\w-]+ [\w-]+ \[[^\]]+] [^\s]+ ".*" HTTP\/1.[01] \d{3} \d+$/

    This would make sure you match everything properly, and the .* part has no choice but to match the location string

    Add brackets to taste to catch what you are interested in

      Hey, the dots disappeared, it should be:

      m/^\d+\.\d+\.\d+\.\d+ [\w-]+ [\w-]+ \[[^\]]+] [^\s]+ ".*" HTTP\/1.[01] \d{3} \d+$/

      of course!