Background: Using dot plus a quantifier in a regex is a tricky business as it's not terribly precise. Worse, when iterated over, it's a performance killer. For example: /".+"/ will cause the .+ to grab as much stuff as it can, but then is forced to backtrack to that final quote, killing efficiency. Using something like /"[^"]+"/ keeps pulling in characters until we get to a quote. No backtracking is involved and it's usually more efficient as a result.
When matching against a string with more than one quote, the people often make it lazy by adding the question mark after the quantifier (/".+?"/). Since I cannot find a reference to the actual mechanics of making a quantifier lazy, I ran some benchmarks to figure out what was going on. I felt that lazy was probably less efficient than greedy due to the extra work the regex compiler needed to do to verify that it was making a minimal match. Most of my benchmarks bore that out, until now.
Now if we switch the first regex line to the following:#!/usr/bin/perl -w use strict; use vars qw(@myvar $iteration); use Benchmark; @myvar = ('asdf"' . 'z' x 2000 . '"asdf', 'z' x 2000 . '"asdf"', '"asdf"' . 'z' x 2000, 'z' x 2000 . '"' . 'z' x 2000 . '"' . 'z' x 2000 ); for $iteration (0 .. (@myvar-1)) { print "Iteration: $iteration\n"; timethese(500000, { Greedy => '$myvar[$iteration] =~ /".+"/', Negated => '$myvar[$iteration] =~ /"[^"]+"/' }); }; Iteration: 0 Benchmark: timing 500000 iterations of Greedy, Negated... Greedy: 13 wallclock secs (13.84 usr + 0.00 sys = 13.84 CPU) Negated: 50 wallclock secs (49.99 usr + 0.00 sys = 49.99 CPU) Iteration: 1 Benchmark: timing 500000 iterations of Greedy, Negated... Greedy: 18 wallclock secs (17.68 usr + 0.00 sys = 17.68 CPU) Negated: 17 wallclock secs (17.02 usr + 0.00 sys = 17.02 CPU) Iteration: 2 Benchmark: timing 500000 iterations of Greedy, Negated... Greedy: 46 wallclock secs (46.75 usr + 0.00 sys = 46.75 CPU) Negated: 2 wallclock secs ( 1.92 usr + 0.00 sys = 1.92 CPU) Iteration: 3 Benchmark: timing 500000 iterations of Greedy, Negated... Greedy: 73 wallclock secs (73.60 usr + 0.00 sys = 73.60 CPU) Negated: 65 wallclock secs (64.97 usr + 0.00 sys = 64.97 CPU)
We find that we get results similar to these:Lazy => '$myvar[$iteration] =~ /".+?"/',
There's nothing too surprising there. Iteration 1 is about even. Iterations 0 and 3 clearly show that lazy is less efficient than greedy. But what the heck is going on with iteration 2? Lazy just beat the pants off of greedy. It went from being even or less efficient to all of a sudden running about 25 times faster. Clearly, there is something internal to the regex engine that's doing something here, but I can't figure it out. Help!Iteration: 0 Benchmark: timing 500000 iterations of Lazy, Negated... Lazy: 184 wallclock secs (183.51 usr + 0.00 sys = 183.51 CPU) Negated: 49 wallclock secs (49.59 usr + 0.00 sys = 49.59 CPU) Iteration: 1 Benchmark: timing 500000 iterations of Lazy, Negated... Lazy: 18 wallclock secs (17.96 usr + 0.00 sys = 17.96 CPU) Negated: 17 wallclock secs (17.03 usr + 0.00 sys = 17.03 CPU) Iteration: 2 Benchmark: timing 500000 iterations of Lazy, Negated... Lazy: 2 wallclock secs ( 2.26 usr + 0.00 sys = 2.26 CPU) Negated: 1 wallclock secs ( 1.92 usr + 0.00 sys = 1.92 CPU) Iteration: 3 Benchmark: timing 500000 iterations of Lazy, Negated... Lazy: 199 wallclock secs (198.66 usr + 0.00 sys = 198.66 CPU) Negated: 64 wallclock secs (64.59 usr + 0.00 sys = 64.59 CPU)
In reply to How are we lazy? by Ovid
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |