bcrowell2 has asked for the wisdom of the Perl Monks concerning the following question:

O Monks,

This article http://swtch.com/~rsc/regexp/regexp1.html critizes Perl's regexp engine and gives an example where it performs very badly. I'm trying to convert the example, given in pseudocode, into real perl code, and not succeeding. Can anyone help?

Here are the things I can't figure out:

(1) They use a repetition count n twice. This could mean either that (a) the two counts have to be the same, or that (b) they could be different. Which one would make sense here?

(2) In case (a), I don't know how to code the constraint on the counts into a perl regexp. In case (b), I get this perl -e 'if (("a" x 29)=~/(a?)*a*/) {print 1}', which doesn't have the poor performance they claim.

Replies are listed 'Best First'.
Re: regexp puzzle
by kennethk (Abbot) on Jan 06, 2011 at 00:41 UTC
    There are people around here more qualified to discuss some of the guts, but I can give you a rough idea of what the article is suggesting. What they are saying is that Perl's algorithm handles repeated backtracking poorly.

    #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw(gettimeofday); for my $n (1 .. 25) { my @start = gettimeofday; $_ = 'a' x $n; /(?:a?){$n}a{$n}/; my @end = gettimeofday; my $elapsed = $end[0] - $start[0] + 10**-6 * ($end[1] - $start[1]) +; print "$elapsed\n"; }

    outputs

    4.5e-005 3.2e-005 3.1e-005 3.4e-005 3.6e-005 4.2e-005 5.4e-005 8.1e-005 0.00013 0.000232 0.000448 0.000899 0.001696 0.003383 0.006309 0.012884 0.026508 0.053743 0.108439 0.212322 0.431355 0.890586 1.703086 3.468719 7.17187

    Frankly, while I'm sure there is room for improvement in the regex library, I would characterize their example as poorly written code - rather than using repeated one-off matches, you should be using variable-width matches:

    #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw(gettimeofday); for my $n (1 .. 25) { my @start = gettimeofday; $_ = 'a' x $n; /a{0,$n}a{$n}/; my @end = gettimeofday; my $elapsed = $end[0] - $start[0] + 10**-6 * ($end[1] - $start[1]) +; print "$elapsed\n"; }

    outputs

    4.2e-005 2.9e-005 2.8e-005 2.9e-005 2.8e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 2.8e-005 2.8e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 2.9e-005 3e-005 2.9e-005 2.9e-005 3e-005 3e-005 2.9e-005

    and does the same thing. Note I've used curly brackets ({}) to indicate repetition; see perlretut for more info.

      Ah, I see -- thanks! The author of the article works for google, and he wrote software for them that would accept regexp from random users on the web, so he wanted something that would always take linear time. (OTOH, seems like he could simply use any random regexp engine and put a time limit on how long it was allowed to tun.)
Re: regexp puzzle
by syphilis (Archbishop) on Jan 06, 2011 at 00:50 UTC
    Here's a demo script for the case n = 26:
    use warnings; use strict; my $x = 'a' x 26; if($x =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaa +aaaaaaaaaaaaaaaaaa/) {print "ok\n"} else {print "not ok\n"}
    That takes a few seconds to run on my box. As you increase the value of n, the time taken increases markedly. (Each time you increment n, you also have to prepend an 'a?' to the beginning of the regex, and append an 'a' to the end.)

    Cheers,
    Rob
Re: regexp puzzle
by scorpio17 (Canon) on Jan 06, 2011 at 15:30 UTC
    You need to read Mastering Regular Expressions. Especially Chapter 4: The Mechanics of Expression Processing. There are basically two major types of regex engines: NFA and DFA. Each type has its own strengths and weaknesses. Perl uses NFA, because it offers the most capabilities (think of it as the more "general purpose, one size fits all" solution). It's one big weakness is that it is susceptible the excessive backtracking if you don't form the expression properly. Understanding the inner workings of the engine helps a LOT when it comes to actually writing a regex that does what you want efficiently. But it's not hard to craft a regex that intentionally exploits a known weakness in a particular engine. I think that putting a text box on a web page and letting users enter their own regex is actually a HUGE security risk. It would be much better to generate a few canned expressions for common things like "begins with", "ends with", "contains", etc... and then allow the user to enter some alphanumeric text (and you'll have to filter out any non-alphanumerics), pick a "match recipe", and then generate a sane regex behind the scenes.

    You could, for this particular example, use a DFA engine instead. It would be faster (due to no backtracking), but much more limited in the types of regexs you could write. Perl's regex engine is one of its main strengths; you wouldn't want to reduce it's capabilities.

    A regex engine is a pretty complex computer science topic. Putting an "anything goes" text box on a webpage and making it idiot proof is a pretty difficult problem. I would think google would have smarter developers on staff... but maybe that explains why my search results have been so poor lately? ;-)

Re: regexp puzzle
by BrowserUk (Patriarch) on Jan 06, 2011 at 16:19 UTC

    Hm. Despite the authors protestations to the contrary, this seems like a lot of academic hot air.

    The only relatively realistic example I found (scanning), was

    Also, while examples as dramatic as this one rarely occur in practice, less dramatic ones do occur. Examples include using (.*) (.*) (.*) (.*) (.*) to split five space-separated fields

    And despite that no one would realistically use this, it doesn't seem that pathological to me?

    $s = 'the quick brown fox jumps over the lazy dog';; print time; @words = $s =~ m[(.*) (.*) (.*) (.*) (.*) (.*) (.*) (.*) (.*)]; print time;; 1294330312.674 1294330312.67453

    A bit like claiming that all aircraft should be VTOL, because otherwise the time spent taking off and landing makes delivering xmas cards to your neighbours very slow.

    Oh . But you're missing the point. No one delivers xmas cards by plane because it is too slow. But if the were all VTOL, you could if you wanted to.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.