js1 has asked for the wisdom of the Perl Monks concerning the following question:
I'm puzzled and seek enlightenment about greedy regex. Here's a line from a log:
2004-05-13 14:02:00 blah blah
I found that a greedy regex used to parse the date time above such as:
\S*\s*\S*
ran faster than a specific regex like this one:
\d{4}\-\d{2}\-\d{2}\s\d{2}:\d{2}:\d{2}
Given, I have to write a perl script to parse 60GB of proxy logs, is it better/faster to use a greedy regex?
Also is there anything else I can do to speed up the parsing apart from using an optimised regex?
Thee has my thanks,
js1.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: fast greedy regex
by Zaxo (Archbishop) on Jun 07, 2004 at 21:14 UTC | |
You have fixed width fields in the date & time, so unpack is worth a look:
After Compline, | [reply] [d/l] |
by BrowserUk (Patriarch) on Jun 07, 2004 at 21:36 UTC | |
I had the same thought, but unless I am doing something dumb (quite likely:), then strangely it seems that unpack is slower than even the explicit regex?
What stupidity am I guilty of? Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham"Think for yourself!" - Abigail | [reply] [d/l] |
by Abigail-II (Bishop) on Jun 07, 2004 at 22:42 UTC | |
Abigail | [reply] [d/l] |
by BrowserUk (Patriarch) on Jun 08, 2004 at 00:02 UTC | |
by borisz (Canon) on Jun 08, 2004 at 00:31 UTC | |
| |
by Anonymous Monk on Jun 08, 2004 at 09:20 UTC | |
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 08, 2004 at 09:38 UTC | |
by grinder (Bishop) on Jun 08, 2004 at 11:20 UTC | |
unpack is worth a look Sometimes, sometimes not. Unpack still has to reparse the format string each time it is called (unless the results are cached internally, but last time I looked they weren't). This cost can add up, which is why substr sometimes outperforms it. One other remark to the OP: Matching a user-agent string with /(\".*\")/ is pretty horrendous, but there's not much you can do, given that there are some more-or-less malicious useragent strings that contain " themselves. When I ran into this problem years ago the only elegant way I found to deal with it reliably was to walk forwards up to the opening double quote isolating the various fields, walk backwards from the end of the string isolating the other fields (the Referrer if memory serves correctly) and what remained was the User agent. This can be done nicely with
The above fragment is non-tested and may contain a fencepost error, but you get the idea. Once this is out of the way it should be possible to construct non-backtracking regexps to match what's left in $front and $back. - another intruder with the mooring of the heat of the Perl | [reply] [d/l] |
|
Re: fast greedy regex
by BrowserUk (Patriarch) on Jun 08, 2004 at 08:43 UTC | |
sfink has already said this, and all credit should go to him as it hadn't even crossed my mind even though I am aware of the exponential time that can result from this type of all-optional regex. As I was still playing with my benchmarks, I thought I would have a look at some of the failure cases and the story they tell is worth (re-)emphasising. The following sets of results are from various variations upon your original regex. Some I've added anchors at either end, some I've switched \S* for \S+. Others I substituted \s for \s*. And various combinations, though not exhaustively. Please look at those numbers carefully. In the final failing case, just 2 records are being processed. One pass and one failure. Your original regex takes 8 million times longer to process those two records than almost any of the slightly more restrictive versions. The testcase is imperfect. It didn't iterate enough times for the bad cases and my machine was not otherwise idle for the (looong) duration of the test. But even if it is 2 or 3 orders of magnitude out, it still makes the point. This is the code use for all the benchmarks. Only the second __DATA__ line varies across all the results shown above Read more... (4 kB)
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham"Think for yourself!" - Abigail | [reply] [d/l] [select] |
|
Re: fast greedy regex
by Roy Johnson (Monsignor) on Jun 07, 2004 at 21:40 UTC | |
If you just wanted the date and time, you could do ($date, $time) = split / /. The PerlMonk tr/// Advocate | [reply] [d/l] |
by js1 (Monk) on Jun 07, 2004 at 22:19 UTC | |
Just to give an idea of what I'm doing this is a line from the log:
2004-03-01 22:00:12 2 15.32.17.34 200 TCP_HIT 3140 326 GET http www.wahm.com http://www.wahm.com/images/vote.gif u779479 DEFAULT_PARENT 61.2.249.106 - "Mozilla/4.0 (compatible; MSIE 5.01; Windows 95)" OBSERVED none - 61.2.249.47 SG-HTTP-Service
And my regex is like this:
Can you see whether a more explicit regex would speed the parse up?
Thanks,
js1. | [reply] [d/l] |
by sfink (Deacon) on Jun 08, 2004 at 04:41 UTC | |
The problem is that because you are using * everywhere, there are an exponential number of ways for that match to fail. Perhaps Perl is clever enough to avoid it, but it seems to me that if you hit a single malformed line, that expression could hang. I'd recommend avoiding the issue by using + pretty much everywhere you have a *, and anchoring the ends with ^ and $. Also, as someone else mentioned, it would be better to get rid of all those printf's and replace them with:
It also appears that you'd be better off doing a little extra work so that you can use split instead of a regex: Alternatively, you could try (but remember to cut the parens off the relevant item.) | [reply] [d/l] [select] |
by js1 (Monk) on Jun 08, 2004 at 20:49 UTC | |
by Roy Johnson (Monsignor) on Jun 07, 2004 at 22:51 UTC | |
For speed, I would expect that one call to print, rather than multiple calls to printf, would be something of a speedup. The PerlMonk tr/// Advocate | [reply] |
by CountZero (Bishop) on Jun 08, 2004 at 06:18 UTC | |
So don't re-invent the (broken) wheel and use a module such as Regexp::Log or Logfile or their derived classes. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law | [reply] |
|
Re: fast greedy regex
by quai (Novice) on Jun 08, 2004 at 13:27 UTC | |
| [reply] |