in reply to fast greedy regex
sfink has already said this, and all credit should go to him as it hadn't even crossed my mind even though I am aware of the exponential time that can result from this type of all-optional regex.
As I was still playing with my benchmarks, I thought I would have a look at some of the failure cases and the story they tell is worth (re-)emphasising.
The following sets of results are from various variations upon your original regex. Some I've added anchors at either end, some I've switched \S* for \S+. Others I substituted \s for \s*. And various combinations, though not exhaustively.
P:\test>362135 Rate ^\S*\s*$ \S+\s* \S*\s* ^\S+\s*$ ^\S*\s$ ^\S+\s$ \S* +\s \S+\s ^\S*\s*$ 1346/s -- -1% -1% -3% -4% -4% - +5% -6% \S+\s* 1365/s 1% -- -0% -2% -3% -3% - +3% -4% \S*\s* 1365/s 1% -0% -- -2% -3% -3% - +3% -4% ^\S+\s*$ 1392/s 3% 2% 2% -- -1% -1% - +1% -2% ^\S*\s$ 1407/s 5% 3% 3% 1% -- -0% - +0% -1% ^\S+\s$ 1408/s 5% 3% 3% 1% 0% -- - +0% -1% \S*\s 1413/s 5% 4% 4% 2% 0% 0% +-- -1% \S+\s 1428/s 6% 5% 5% 3% 1% 1% +1% --
Not much difference between them a few percent here and there. But, watch what happens if we omit the second " from the second data line.
P:\test>362135 -N=1 (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate ^\S*\s*$ \S*\s* \S+\s* ^\S+\s*$ ^\S*\s$ \S*\s + ^\S+\s$ \S+\s ^\S*\s*$ 3.07e-002/s -- -1% -100% -100% -100% -100% + -100% -100% \S*\s* 3.10e-002/s 1% -- -100% -100% -100% -100% + -100% -100% \S+\s* 6685/s 21746132% 21548263% -- -1% -22% -23% + -51% -51% ^\S+\s*$ 6747/s 21947909% 21748203% 1% -- -21% -22% + -51% -51% ^\S*\s$ 8533/s 27758186% 27505613% 28% 26% -- -1% + -38% -38% \S*\s 8645/s 28123106% 27867213% 29% 28% 1% -- + -37% -37% ^\S+\s$ 13740/s 44696322% 44289628% 106% 104% 61% 59% + -- -0% \S+\s 13752/s 44736625% 44329565% 106% 104% 61% 59% + 0% --
That is a pretty persuasive argument for not using .*. (It 400,000 times slower!)
Especially when there are other variations (eg. ^\S+\s$ ) that actually work out slightly quicker when the match succeeds, and that fail quickly when they don't.
Hopefully that is convincing enough, but you may be wondering why I used '- - - - - - - - - - - - - - - "- -" - - - - -' for the failing test rather than just omitting the quote from your sample. The next results show why.
All I did was double the number of dashes everywhere and as you can see, the the match results show little change.
P:\test>362135 -N=1 Rate ^\S*\s*$ \S+\s* \S+\s ^\S+\s*$ ^\S+\s$ \S*\s* ^\S*\ +s$ \S*\s ^\S*\s*$ 7440/s -- -1% -1% -2% -3% -4% - +5% -6% \S+\s* 7505/s 1% -- -0% -1% -2% -3% - +5% -5% \S+\s 7538/s 1% 0% -- -0% -1% -3% - +4% -4% ^\S+\s*$ 7558/s 2% 1% 0% -- -1% -3% - +4% -4% ^\S+\s$ 7645/s 3% 2% 1% 1% -- -1% - +3% -3% \S*\s* 7756/s 4% 3% 3% 3% 1% -- - +1% -2% ^\S*\s$ 7862/s 6% 5% 4% 4% 3% 1% +-- -0% \S*\s 7891/s 6% 5% 5% 4% 3% 2% +0% --
But now see what happens when I remove the second quote again
P:\test>362135 -N=1 (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate \S*\s* ^\S*\s*$ \S+\s* ^\S+\s*$ ^\S*\s$ \S* +\s \S+\s ^\S+\s$ \S*\s* 1.23e-003/s -- -0% -100% -100% -100% -10 +0% -100% -100% ^\S*\s*$ 1.23e-003/s 0% -- -100% -100% -100% -10 +0% -100% -100% \S+\s* 34.3/s 2795013% 2782516% -- -8% -99% -9 +9% -100% -100% ^\S+\s*$ 37.4/s 3045735% 3032117% 9% -- -99% -9 +9% -100% -100% ^\S*\s$ 4577/s 372761400% 371094785% 13236% 12138% -- -1 +2% -53% -58% \S*\s 5228/s 425715150% 423811779% 15131% 13877% 14% +-- -46% -52% \S+\s 9683/s 788529651% 785004138% 28111% 25789% 112% 8 +5% -- -11% ^\S+\s$ 10828/s 881763143% 877820783% 31447% 28850% 137% 10 +7% 12% --
Please look at those numbers carefully. In the final failing case, just 2 records are being processed. One pass and one failure.
Your original regex takes 8 million times longer to process those two records than almost any of the slightly more restrictive versions. The testcase is imperfect. It didn't iterate enough times for the bad cases and my machine was not otherwise idle for the (looong) duration of the test. But even if it is 2 or 3 orders of magnitude out, it still makes the point.
This is the code use for all the benchmarks. Only the second __DATA__ line varies across all the results shown above
#! perl -slw use strict; use Benchmark qw[ cmpthese ]; our $N ||= 10; our @data = (scalar <DATA>) x $N; push @data, <DATA>; cmpthese( -1, { '\S*\s*' => q[ my @bits; @bits = m[(\S*\s*\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\ +s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S +*)\s*(\".*\")\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)] for @data; ], '\S+\s*' => q[ my @bits; @bits = m[(\S+\s*\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\ +s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S ++)\s*(\".*\")\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)] for @data; ], '\S*\s' => q[ my @bits; @bits = m[(\S*\s\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*) +\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\".*\")\s( +\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)] for @data; ], '\S+\s' => q[ my @bits; @bits = m[(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+) +\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\".*\")\s( +\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)] for @data; ], '^\S*\s*$' => q[ my @bits; @bits = m[(\S*\s*\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\ +s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S +*)\s*(\".*\")\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)] for @data; ], '^\S+\s*$' => q[ my @bits; @bits = m[(\S+\s*\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\ +s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S ++)\s*(\".*\")\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)] for @data; ], '^\S*\s$' => q[ my @bits; @bits = m[(\S*\s\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*) +\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\".*\")\s( +\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)] for @data; ], '^\S+\s$' => q[ my @bits; @bits = m[(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+) +\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\".*\")\s( +\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)] for @data; ], }); __DATA__ 2004-03-01 22:00:12 2 15.32.17.34 200 TCP_HIT 3140 326 GET http www.wa +hm.com http://www.wahm.com/images/vote.gif u779479 DEFAULT_PARENT 61. +2.249.106 - "Mozilla/4.0 (compatible; MSIE 5.01; Windows 95)" OBSERVE +D none - 61.2.249.47 SG-HTTP-Service -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "-- -- -- -- -- -- --
|
|---|