comment on

I've been concentrating many of my responses lately on regex-related questions as I wish to be at one with Perl's regex engine. To that end, I've been playing around with some benchmarks and have come across a result that I do not understand.

Background: Using dot plus a quantifier in a regex is a tricky business as it's not terribly precise. Worse, when iterated over, it's a performance killer. For example: /".+"/ will cause the .+ to grab as much stuff as it can, but then is forced to backtrack to that final quote, killing efficiency. Using something like /"[^"]+"/ keeps pulling in characters until we get to a quote. No backtracking is involved and it's usually more efficient as a result.

When matching against a string with more than one quote, the people often make it lazy by adding the question mark after the quantifier (/".+?"/). Since I cannot find a reference to the actual mechanics of making a quantifier lazy, I ran some benchmarks to figure out what was going on. I felt that lazy was probably less efficient than greedy due to the extra work the regex compiler needed to do to verify that it was making a minimal match. Most of my benchmarks bore that out, until now.

#!/usr/bin/perl -w
use strict;

use vars qw(@myvar $iteration);
use Benchmark;
@myvar = ('asdf"' . 'z' x 2000 . '"asdf',
          'z' x 2000 . '"asdf"',
          '"asdf"' . 'z' x 2000,
          'z' x 2000 . '"' . 'z' x 2000 . '"' . 'z' x 2000
          );

for $iteration (0 .. (@myvar-1)) {
    print "Iteration:  $iteration\n";
    timethese(500000, {
        Greedy  => '$myvar[$iteration] =~ /".+"/',
        Negated => '$myvar[$iteration] =~ /"[^"]+"/'
    });
};

Iteration:  0
Benchmark: timing 500000 iterations of Greedy, Negated...
    Greedy: 13 wallclock secs (13.84 usr +  0.00 sys = 13.84 CPU)
   Negated: 50 wallclock secs (49.99 usr +  0.00 sys = 49.99 CPU)
Iteration:  1
Benchmark: timing 500000 iterations of Greedy, Negated...
    Greedy: 18 wallclock secs (17.68 usr +  0.00 sys = 17.68 CPU)
   Negated: 17 wallclock secs (17.02 usr +  0.00 sys = 17.02 CPU)
Iteration:  2
Benchmark: timing 500000 iterations of Greedy, Negated...
    Greedy: 46 wallclock secs (46.75 usr +  0.00 sys = 46.75 CPU)
   Negated:  2 wallclock secs ( 1.92 usr +  0.00 sys =  1.92 CPU)
Iteration:  3
Benchmark: timing 500000 iterations of Greedy, Negated...
    Greedy: 73 wallclock secs (73.60 usr +  0.00 sys = 73.60 CPU)
   Negated: 65 wallclock secs (64.97 usr +  0.00 sys = 64.97 CPU)
[download]

Now if we switch the first regex line to the following:

        Lazy    => '$myvar[$iteration] =~ /".+?"/',
[download]

We find that we get results similar to these:

Iteration:  0
Benchmark: timing 500000 iterations of Lazy, Negated...
      Lazy: 184 wallclock secs (183.51 usr +  0.00 sys = 183.51 CPU)
   Negated: 49 wallclock secs (49.59 usr +  0.00 sys = 49.59 CPU)
Iteration:  1
Benchmark: timing 500000 iterations of Lazy, Negated...
      Lazy: 18 wallclock secs (17.96 usr +  0.00 sys = 17.96 CPU)
   Negated: 17 wallclock secs (17.03 usr +  0.00 sys = 17.03 CPU)
Iteration:  2
Benchmark: timing 500000 iterations of Lazy, Negated...
      Lazy:  2 wallclock secs ( 2.26 usr +  0.00 sys =  2.26 CPU)
   Negated:  1 wallclock secs ( 1.92 usr +  0.00 sys =  1.92 CPU)
Iteration:  3
Benchmark: timing 500000 iterations of Lazy, Negated...
      Lazy: 199 wallclock secs (198.66 usr +  0.00 sys = 198.66 CPU)
   Negated: 64 wallclock secs (64.59 usr +  0.00 sys = 64.59 CPU)
[download]

There's nothing too surprising there. Iteration 1 is about even. Iterations 0 and 3 clearly show that lazy is less efficient than greedy. But what the heck is going on with iteration 2? Lazy just beat the pants off of greedy. It went from being even or less efficient to all of a sudden running about 25 times faster. Clearly, there is something internal to the regex engine that's doing something here, but I can't figure it out. Help!

In reply to How are we lazy? by Ovid

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.