in reply to regexp puzzle

There are people around here more qualified to discuss some of the guts, but I can give you a rough idea of what the article is suggesting. What they are saying is that Perl's algorithm handles repeated backtracking poorly.

#!/usr/bin/perl use strict; use warnings; use Time::HiRes qw(gettimeofday); for my $n (1 .. 25) { my @start = gettimeofday; $_ = 'a' x $n; /(?:a?){$n}a{$n}/; my @end = gettimeofday; my $elapsed = $end[0] - $start[0] + 10**-6 * ($end[1] - $start[1]) +; print "$elapsed\n"; }

outputs

4.5e-005 3.2e-005 3.1e-005 3.4e-005 3.6e-005 4.2e-005 5.4e-005 8.1e-005 0.00013 0.000232 0.000448 0.000899 0.001696 0.003383 0.006309 0.012884 0.026508 0.053743 0.108439 0.212322 0.431355 0.890586 1.703086 3.468719 7.17187

Frankly, while I'm sure there is room for improvement in the regex library, I would characterize their example as poorly written code - rather than using repeated one-off matches, you should be using variable-width matches:

#!/usr/bin/perl use strict; use warnings; use Time::HiRes qw(gettimeofday); for my $n (1 .. 25) { my @start = gettimeofday; $_ = 'a' x $n; /a{0,$n}a{$n}/; my @end = gettimeofday; my $elapsed = $end[0] - $start[0] + 10**-6 * ($end[1] - $start[1]) +; print "$elapsed\n"; }

outputs

4.2e-005 2.9e-005 2.8e-005 2.9e-005 2.8e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 2.8e-005 2.8e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 3e-005 2.9e-005 2.9e-005 3e-005 3e-005 2.9e-005

and does the same thing. Note I've used curly brackets ({}) to indicate repetition; see perlretut for more info.

Replies are listed 'Best First'.
Re^2: regexp puzzle
by bcrowell2 (Friar) on Jan 06, 2011 at 01:35 UTC
    Ah, I see -- thanks! The author of the article works for google, and he wrote software for them that would accept regexp from random users on the web, so he wanted something that would always take linear time. (OTOH, seems like he could simply use any random regexp engine and put a time limit on how long it was allowed to tun.)