arturo has asked for the wisdom of the Perl Monks concerning the following question:
Parsing an Apache logfile. Yes, I know that there is such a thing as Apache::ParseLog but since it's OO, I have efficiency worries and anyhow, the real point of this question is for me to hone some regex skills if possible. The problem: my current method looks like this:
The problem is that (l)users are sometimes entering something silly into their location windows, such aswhile (<LOGFILE>) { my $hit = parse_line($_); #do a bunch of stuff with the hashref, like # insert it into a DB. } sub parse_line { my $line = shift; if ($line =~ /^(\S+).*?\[(\S+).*?] (\S+) "([^"]+)" (\d+)/) { return {host_ip=>$1, timestamp=>$2, vhost=>$3, request=>$4, HTTP_CODE=>$5 }; # and some stuff to handle errors, that needn't bother us }
Now I *can* throw them away, and given that these files are taking
hours to process (the DB inserts take a while), I'd like to know whether it's worth my while to attempt to handle the silly URLS or whether I should
just forget about them as this sub processes over 2 million lines a run and I'd like to make it as lean as possible.
How expensive in terms of processing time would it be to craft a smarter regex?
yes, I will deal with that dot-star in there ...
Sample goofy line from log 1.2.3.4 - - [10/Oct/2000:00:19:13 -0400] www.foo.edu "GET /"http://www.TheCounter.com/" HTTP/1.1" 404 2925 i.e. ipaddress, username, realm, timestamp, virtual host name, "request string", http code
Philosophy can be made out of anything -- or less
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Crafting a regex
by merlyn (Sage) on Oct 11, 2000 at 21:12 UTC | |
by arturo (Vicar) on Oct 11, 2000 at 21:31 UTC | |
|
Re: Crafting a regex
by ahunter (Monk) on Oct 11, 2000 at 21:29 UTC | |
|
Re: Crafting a regex
by wardk (Deacon) on Oct 11, 2000 at 21:14 UTC | |
|
Re: Crafting a regex
by mirod (Canon) on Oct 11, 2000 at 21:26 UTC | |
by mirod (Canon) on Oct 11, 2000 at 21:28 UTC |