Parsing an Apache logfile. Yes, I know that there is such a thing as Apache::ParseLog but since it's OO, I have efficiency worries and anyhow, the real point of this question is for me to hone some regex skills if possible. The problem: my current method looks like this:
The problem is that (l)users are sometimes entering something silly into their location windows, such aswhile (<LOGFILE>) { my $hit = parse_line($_); #do a bunch of stuff with the hashref, like # insert it into a DB. } sub parse_line { my $line = shift; if ($line =~ /^(\S+).*?\[(\S+).*?] (\S+) "([^"]+)" (\d+)/) { return {host_ip=>$1, timestamp=>$2, vhost=>$3, request=>$4, HTTP_CODE=>$5 }; # and some stuff to handle errors, that needn't bother us }
Now I *can* throw them away, and given that these files are taking
hours to process (the DB inserts take a while), I'd like to know whether it's worth my while to attempt to handle the silly URLS or whether I should
just forget about them as this sub processes over 2 million lines a run and I'd like to make it as lean as possible.
How expensive in terms of processing time would it be to craft a smarter regex?
yes, I will deal with that dot-star in there ...
Sample goofy line from log 1.2.3.4 - - [10/Oct/2000:00:19:13 -0400] www.foo.edu "GET /"http://www.TheCounter.com/" HTTP/1.1" 404 2925 i.e. ipaddress, username, realm, timestamp, virtual host name, "request string", http code
Philosophy can be made out of anything -- or less
In reply to Crafting a regex by arturo
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |