fast greedy regex

js1 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: fast greedy regex by Zaxo (Archbishop) on Jun 07, 2004 at 21:14 UTC
You have fixed width fields in the date & time, so unpack is worth a look: `my $string = '2004-05-13 14:02:00 blah blah'; my ($year, $month, $day, $hour, $min, $sec, $rest) = unpack 'a4 x a2 x a2 x a2 x a2 x a2 x a*', $string;` [download] After Compline, Zaxo	[reply] [d/l]
Re^2: fast greedy regex by BrowserUk (Patriarch) on Jun 07, 2004 at 21:36 UTC
I had the same thought, but unless I am doing something dumb (quite likely:), then strangely it seems that unpack is slower than even the explicit regex? #! perl -slw use strict; use Benchmark qw[ cmpthese ]; our @data = map{ join' ', '2004-05-13', '14:02:00', ('blah') x (1+rand( 9 )) } 1 .. 1000; cmpthese( -1, { greedy => q[ my( $date, $time, $text ); m[(^\S)\s(\S)\s(.$)] and ( $date, $time, $text ) = ( $1, $2, $3 ) # and print "greedy: $date\|$time\|$text" for @data; ], explicit => q[ my( $date, $time, $text ); m[(^\d{4}\-\d{2}\-\d{2})\s(\d{2}:\d{2}:\d{2})\s(.$)] and ( $date, $time, $text ) = ( $1, $2, $3 ) # and print "explicit: $date\|$time\|$text" for @data; ], unpack => q[ my( $date, $time, $text ); ( $date, $time, $text ) = unpack 'A10 x A8 x A*', $_ # and print "unpack: $date\|$time\|$text" for @data; ], }); __END__ P:\test>362106 Rate unpack explicit greedy unpack 158/s -- -41% -53% explicit 267/s 70% -- -21% greedy 338/s 114% 26% -- [download] What stupidity am I guilty of? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re: fast greedy regex by Abigail-II (Bishop) on Jun 07, 2004 at 22:42 UTC
Strange indeed. I get similar results, although with smaller differences. However, if I change the tests (but not the regexes or the data) slightly, I do get the results where unpack wins: #! perl -slw use strict; use Benchmark qw [cmpthese]; our @data = map{ join' ', '2004-05-13', '14:02:00', ('blah') x (1 + rand (9)) } 1 .. 1000; our (@greedy, @explicit, @unpack); cmpthese (-1, { greedy => '@greedy = map {/(^\S)\s(\S)\s(.$)/} @data +', explicit => '@explicit = map {/(^\d{4}\-\d{2}\-\d{2})\s (\d{2}:\d{2}:\d{2})\s(.$)/x} @data +', unpack => '@unpack = map {unpack "A10 x A8 x A*" => $_} @data +', }); die unless "@greedy" eq "@explicit" && "@greedy" eq "@unpack"; __END__ Rate explicit greedy unpack explicit 86.1/s -- -6% -25% greedy 91.6/s 6% -- -20% unpack 114/s 33% 25% -- [download] Abigail	[reply] [d/l]
Re^2: fast greedy regex by BrowserUk (Patriarch) on Jun 08, 2004 at 00:02 UTC
Re^3: fast greedy regex by borisz (Canon) on Jun 08, 2004 at 00:31 UTC
Some notes below your chosen depth have not been shown here
Re^3: fast greedy regex by Anonymous Monk on Jun 08, 2004 at 09:20 UTC
I was going to suggest `substr()` (which is around twice as fast as any of these to extract just these three pieces of data, perfectly formatted) - until I saw the real regex being used. I can't imagine `substr`, `unpack` or any rigidly formatted extraction method is any use at all for that lot.	[reply] [d/l] [select]
Re^4: fast greedy regex by BrowserUk (Patriarch) on Jun 08, 2004 at 09:38 UTC
Re^2: fast greedy regex by grinder (Bishop) on Jun 08, 2004 at 11:20 UTC
unpack is worth a look Sometimes, sometimes not. Unpack still has to reparse the format string each time it is called (unless the results are cached internally, but last time I looked they weren't). This cost can add up, which is why substr sometimes outperforms it. One other remark to the OP: Matching a user-agent string with `/(\".\")/` is pretty horrendous, but there's not much you can do, given that there are some more-or-less malicious useragent strings that contain " themselves. When I ran into this problem years ago the only elegant way I found to deal with it reliably was to walk forwards up to the opening double quote isolating the various fields, walk backwards from the end of the string isolating the other fields (the Referrer if memory serves correctly) and what remained was the User agent. This can be done nicely with `$front = substr( $_, 0, index($_, '"' ) - 1, '' ); $back = substr( $_, rindex( $_, '"' ) + 1, '' ); $user_agent = $_; # modulo a quote or space or two` [download] The above fragment is non-tested* and may contain a fencepost error, but you get the idea. Once this is out of the way it should be possible to construct non-backtracking regexps to match what's left in `$front` and `$back`. - another intruder with the mooring of the heat of the Perl	[reply] [d/l]
Re: fast greedy regex by BrowserUk (Patriarch) on Jun 08, 2004 at 08:43 UTC
sfink has already said this, and all credit should go to him as it hadn't even crossed my mind even though I am aware of the exponential time that can result from this type of all-optional regex. As I was still playing with my benchmarks, I thought I would have a look at some of the failure cases and the story they tell is worth (re-)emphasising. The following sets of results are from various variations upon your original regex. Some I've added anchors at either end, some I've switched \S* for \S+. Others I substituted \s for \s. And various combinations, though not exhaustively. This uses 10 copies of your sample dataline + 1 (matching) like this: '- - - - - - - - - - - - - - - - "- -" - - - - -'. `P:\test>362135 Rate ^\S\s$ \S+\s \S\s ^\S+\s$ ^\S\s$ ^\S+\s$ \S* +\s \S+\s ^\S\s$ 1346/s -- -1% -1% -3% -4% -4% - +5% -6% \S+\s* 1365/s 1% -- -0% -2% -3% -3% - +3% -4% \S\s 1365/s 1% -0% -- -2% -3% -3% - +3% -4% ^\S+\s$ 1392/s 3% 2% 2% -- -1% -1% - +1% -2% ^\S\s$ 1407/s 5% 3% 3% 1% -- -0% - +0% -1% ^\S+\s$ 1408/s 5% 3% 3% 1% 0% -- - +0% -1% \S\s 1413/s 5% 4% 4% 2% 0% 0% +-- -1% \S+\s 1428/s 6% 5% 5% 3% 1% 1% +1% --` [download] Not much difference between them a few percent here and there. But, watch what happens if we omit the second " from the second data line. 1 sample + 1 failing (missing ") '- - - - - - - - - - - - - - - - "- - - - - - -' P:\test>362135 -N=1 (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate ^\S\s$ \S\s \S+\s* ^\S+\s$ ^\S\s$ \S\s + ^\S+\s$ \S+\s ^\S\s$ 3.07e-002/s -- -1% -100% -100% -100% -100% + -100% -100% \S\s* 3.10e-002/s 1% -- -100% -100% -100% -100% + -100% -100% \S+\s* 6685/s 21746132% 21548263% -- -1% -22% -23% + -51% -51% ^\S+\s$ 6747/s 21947909% 21748203% 1% -- -21% -22% + -51% -51% ^\S\s$ 8533/s 27758186% 27505613% 28% 26% -- -1% + -38% -38% \S\s 8645/s 28123106% 27867213% 29% 28% 1% -- + -37% -37% ^\S+\s$ 13740/s 44696322% 44289628% 106% 104% 61% 59% + -- -0% \S+\s 13752/s 44736625% 44329565% 106% 104% 61% 59% + 0% -- [download] That is a pretty persuasive argument for not using .. (It 400,000 times slower!) Especially when there are other variations (eg. ^\S+\s$ ) that actually work out slightly quicker when the match succeeds, and that fail quickly when they don't. Hopefully that is convincing enough, but you may be wondering why I used '- - - - - - - - - - - - - - - "- -" - - - - -' for the failing test rather than just omitting the quote from your sample. The next results show why. 1 * sample + 1 matching '-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "-- --" -- -- -- -- --' All I did was double the number of dashes everywhere and as you can see, the the match results show little change. `P:\test>362135 -N=1 Rate ^\S\s$ \S+\s* \S+\s ^\S+\s$ ^\S+\s$ \S\s* ^\S\ +s$ \S\s ^\S\s$ 7440/s -- -1% -1% -2% -3% -4% - +5% -6% \S+\s* 7505/s 1% -- -0% -1% -2% -3% - +5% -5% \S+\s 7538/s 1% 0% -- -0% -1% -3% - +4% -4% ^\S+\s$ 7558/s 2% 1% 0% -- -1% -3% - +4% -4% ^\S+\s$ 7645/s 3% 2% 1% 1% -- -1% - +3% -3% \S\s* 7756/s 4% 3% 3% 3% 1% -- - +1% -2% ^\S\s$ 7862/s 6% 5% 4% 4% 3% 1% +-- -0% \S\s 7891/s 6% 5% 5% 4% 3% 2% +0% --` [download] But now see what happens when I remove the second quote again 1 * sample + 1 failing (missing ") '-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "-- -- -- -- -- -- --' P:\test>362135 -N=1 (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate \S\s ^\S\s$ \S+\s* ^\S+\s$ ^\S\s$ \S* +\s \S+\s ^\S+\s$ \S\s 1.23e-003/s -- -0% -100% -100% -100% -10 +0% -100% -100% ^\S\s$ 1.23e-003/s 0% -- -100% -100% -100% -10 +0% -100% -100% \S+\s* 34.3/s 2795013% 2782516% -- -8% -99% -9 +9% -100% -100% ^\S+\s$ 37.4/s 3045735% 3032117% 9% -- -99% -9 +9% -100% -100% ^\S\s$ 4577/s 372761400% 371094785% 13236% 12138% -- -1 +2% -53% -58% \S\s 5228/s 425715150% 423811779% 15131% 13877% 14% +-- -46% -52% \S+\s 9683/s 788529651% 785004138% 28111% 25789% 112% 8 +5% -- -11% ^\S+\s$ 10828/s 881763143% 877820783% 31447% 28850% 137% 10 +7% 12% -- [download] Please look at those numbers carefully. In the final failing case, just 2 records are being processed. One pass and one failure. Your original regex takes 8 million times longer* to process those two records than almost any of the slightly more restrictive versions. The testcase is imperfect. It didn't iterate enough times for the bad cases and my machine was not otherwise idle for the (looong) duration of the test. But even if it is 2 or 3 orders of magnitude out, it still makes the point. This is the code use for all the benchmarks. Only the second __DATA__ line varies across all the results shown above Read more... (4 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: fast greedy regex by Roy Johnson (Monsignor) on Jun 07, 2004 at 21:40 UTC
It depends on how finely parsed you need it to be. Your first example has basically three subsections. Your second example is much more broken-out. Neither of them captures anything. If you just wanted the date and time, you could do `($date, $time) = split / /`. The PerlMonk `tr///` Advocate	[reply] [d/l]
Re^2: fast greedy regex by js1 (Monk) on Jun 07, 2004 at 22:19 UTC
Just to give an idea of what I'm doing this is a line from the log: 2004-03-01 22:00:12 2 15.32.17.34 200 TCP_HIT 3140 326 GET http www.wahm.com http://www.wahm.com/images/vote.gif u779479 DEFAULT_PARENT 61.2.249.106 - "Mozilla/4.0 (compatible; MSIE 5.01; Windows 95)" OBSERVED none - 61.2.249.47 SG-HTTP-Service And my regex is like this: while (<>){ /(\S\s\S)\s(\S)\s(\S)\s(\S)\s(\S)\s(\S)\s(\S)\s +(\S)\s(\S)\s(\S)\s(\S)\s(\S)\s(\S)\s(\S)\s(\S)\s(\". +\")\s(\S)\s(\S)\s(\S)\s(\S)\s(\S*)/; #print "\n"; #printf "\ndate time = %s",$1; #printf "\ntime taken = %s",$2; #printf "\nc-ip = %s",$3; #printf "\nsc-status = %s",$4; #printf "\ns-action = %s",$5; #printf "\nsc-bytes = %s",$6; #printf "\ncs-bytes = %s",$7; #printf "\ncs-method = %s",$8; #printf "\ncs-uri-scheme = %s",$9; #printf "\ncs-host = %s",$10; #printf "\ncs-uri-stem = %s",$11; #printf "\ncs-username = %s",$12; #printf "\ns-hierarchy = %s",$13; #printf "\ns-supplier-name = %s",$14; #printf "\ncs(Content-Type)= %s",$15; #printf "\ncs(User-Agent) = %s",$16; #printf "\nsc-filter-result = %s",$17; #printf "\nsc-filter-category = %s",$18; #printf "\nx-virus-id = %s",$19; #printf "\ns-ip = %s",$20; #printf "\ns-sitename = %s",$21; } [download] Can you see whether a more explicit regex would speed the parse up? Thanks, js1.	[reply] [d/l]
Re^3: fast greedy regex by sfink (Deacon) on Jun 08, 2004 at 04:41 UTC
Ouch. Perhaps all of your log lines are perfectly formatted, but I would still recommend not doing that just in case you somehow have something fail. At least anchor the expression. The problem is that because you are using * everywhere, there are an exponential number of ways for that match to fail. Perhaps Perl is clever enough to avoid it, but it seems to me that if you hit a single malformed line, that expression could hang. I'd recommend avoiding the issue by using + pretty much everywhere you have a , and anchoring the ends with ^ and $. Also, as someone else mentioned, it would be better to get rid of all those printf's and replace them with: `print <<"END_FORMAT"; date time = $1 time taken = $2 c-ip = $3 . . . END_FORMAT` [download] It also appears that you'd be better off doing a little extra work so that you can use `split` instead of a regex: `my ($user_agent) = /\"(.)\"/; s/\".\"//; my @F = split(/\s+/); print <<"END_FORMAT"; date time = $F[0] $F[1] time taken = $F[2] . . . cs(Content-Type) = $F[15] cs(User-Agent) = $user_agent sc-filter-result = $F[16] . . . END_FORMAT` [download] Alternatively, you could try `my @F = /(\" .? \" \| \S+)/gx;` [download] (but remember to cut the parens off the relevant item.)	[reply] [d/l] [select]
Re^4: fast greedy regex by js1 (Monk) on Jun 08, 2004 at 20:49 UTC
Re^3: fast greedy regex by Roy Johnson (Monsignor) on Jun 07, 2004 at 22:51 UTC
I don't see any reason that a more explicit regex would speed it up. The only obvious speed benefit from a more explicit match is that failure could happen sooner. For speed, I would expect that one call to print, rather than multiple calls to printf, would be something of a speedup. The PerlMonk `tr///` Advocate	[reply]
Re^3: fast greedy regex by CountZero (Bishop) on Jun 08, 2004 at 06:18 UTC
One unexpected space somewhere in this string and you are toast. So don't re-invent the (broken) wheel and use a module such as Regexp::Log or Logfile or their derived classes. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: fast greedy regex by quai (Novice) on Jun 08, 2004 at 13:27 UTC
using + (one, or more) insted of * (zero, or more) and ^$ to mark staring and end will give you almost 100% speed up (here atleast)	[reply]