in reply to Removing backtracking from a .*? regexp

For curiosity on the performance issue, I have conducted some experiments to compare the performance between a straight regular expression with capture and positive look-ahead with substr & length functions.

The following is my code for the experiment -
use strict; use Benchmark qw/timethese cmpthese/; # use re 'debug'; chomp(my @lines = <DATA>); my $target = shift || 'word'; my $re_positive_lookahead = qr/^a=<(?=.*?word)/; my $re_loose = qr/a=<(.*?$target.*?)>/; cmpthese( timethese(1000000, { 'Match_L' => '&Match_Loose', 'Match_P' => '&Match_PLAhead_plus_Substr', }) ); sub Match_Loose { foreach (@lines) { if (/$re_loose/) { my $word = $1; } } } sub Match_PLAhead_plus_Substr { foreach (@lines) { if (/$re_positive_lookahead/) { my $word = substr($_, 3, length($_)-4); } } } __DATA__ a=<swords> a=<wordy> b=<rappinghood> a=<thisword> b=<thatword> a=<foreword> b=<junk> a=<nothing> b=<word> b=<wordplay> b=<end>
And the result of the experiment is as follows -
Benchmark: timing 1000000 iterations of Match_L, Match_P... Match_L: 31 wclock secs (31.16 usr + 0.00 sys = 31.16 CPU) @ 32092.4 +3/s Match_P: 27 wclock secs (27.64 usr + 0.00 sys = 27.64 CPU) @ 36179.4 +5/s Rate Match_L Match_P Match_L 32092/s -- -11% Match_P 36179/s 13% --
The observation is that positive lookahead combined with substr is about 12% faster than straight regexp with capture, which is not an insignificant speed improvement.

Replies are listed 'Best First'.
Re: Re: Removing backtracking from a .*? regexp
by Anonymous Monk on Nov 18, 2003 at 01:15 UTC
    The observation is that positive lookahead combined with substr is about 12% faster than straight regexp with capture, which is not an insignificant speed improvement.

    You are testing a highly sanitized version of the problem. The OP reduced it a barebones example. Surely the OP's logs do not begin with, nor even contain, the data shown, making it more complex to use substr to grab the right chunk of string. Secondly, your speed gain disappears entirely when you remove trailing backtracking from the loose regex as mentioned earlier. Try it with the this regex:

    my $re_loose = qr/a=<(.*?$target[^>]+)>/;

    Last, but most certainly not least: It is far too easy to reach unjustfied conclusions when benchmarking.