thof has asked for the wisdom of the Perl Monks concerning the following question:

Hello guys,

I'm totally confused... I tried to benchmark two regexes /^(b|a)*$/ and /^(a+|b)*/ against input string "aaaaaaaaab" x <<some big numbers>>. I was sure that the second one is much faster for such string, but when I run my script the results were quite shocking to me (I use similar approach to J. Friedl in his book):

<no of chars x 10 performed 1000 times>;<time in sec - 1st regex>;<tim +e in sec - 2st regex> 200;0.113;0.060 400;0.225;0.112 800;0.448;0.226 1600;0.890;0.763 3200;1.769;1.627 6400;3.539;3.427 12800;7.120;7.215 25600;14.256;58.209 51200;29.847;95.463 102400;65.144;148.634

I don't understand two things. First, why for strings shorter than 128000 chars times are almost the same (IMO the 2nd regex should be faster). Second, why for string larger than 256000 chars the 2nd regex is much slower than the 1st one?

I looked in debug optimizer information, but I didn't find anything particularly interesting (i.e. there are no applicable optimizations for this input string).

It has caused me a headache so I decided to check the same cases in Java. And it seems that regex engine in Java works as I expected:

200;0.145;0.125 400;0.217;0.081 800;0.436;0.165 1600;0.854;0.334 3200;1.83;0.661 6400;4.185;1.327 12800;8.488;2.767 25600;17.791;6.181 51200;37.271;12.856 102400;76.931;24.81

I have no idea why Perl regex engine behaves strangely in this case. If someone could explain it to me I would be grateful.

Perl testing module:
use Time::HiRes 'time'; #~ use re 'debug'; $TimesToDo = 1000; $NoOfChars = 100; $Multiplier = 2; $NoOfExec = 10; $InputString = "aaaaaaaaab"; for($i = 0; $i < $NoOfExec; $i++){ $NoOfChars = $NoOfChars * $Multiplier; $TestString = $InputString x $NoOfChars; $Count = $TimesToDo; $StartTime = time(); while ($Count-- > 0) { $TestString =~ m/^(b|a)*$/; } $EndTime = time(); $Diff = $EndTime - $StartTime; $Count = $TimesToDo; $StartTime = time(); while ($Count-- > 0) { $TestString =~ m/^(a+|b)*$/; } $EndTime = time(); printf("%d;%.3f;%.3f\n", $NoOfChars, $Diff, $EndTime - $StartTime) +; }
Java testing class (it can be executed using command "javac test2.java && java -Xss512m test2"):
import java.util.regex.*; public class test2 { public static void main(String [] args) { Matcher regex1 = Pattern.compile("^(b|a)*$").matcher(""); Matcher regex2 = Pattern.compile("^(a+|b)*$").matcher(""); int timesToDo = 1000; int noOfChars = 100; int multiplier = 2; int noOfExec = 10; String inputString = "aaaaaaaaab"; for(int j = 0; j < noOfExec; j++){ StringBuffer temp = new StringBuffer(); noOfChars = noOfChars * multiplier; for (int i = 0; i < noOfChars; i++) temp.append(inputString); String testString = temp.toString(); long count = timesToDo; long startTime = System.currentTimeMillis(); while (--count > 0) regex1.reset(testString).find(); double diffSeconds = (System.currentTimeMillis() - startTi +me)/1000.0; count = timesToDo; startTime = System.currentTimeMillis(); while (--count > 0) regex2.reset(testString).find(); double seconds = (System.currentTimeMillis() - startTime)/ +1000.0; System.out.println(noOfChars+";"+diffSeconds+";"+seconds); } } }

Btw. I use Manjaro Linux with Perl 5.20.1.

Replies are listed 'Best First'.
Re: Strange behaviour of the regex engine (pathological protection)
by tye (Sage) on Jan 06, 2015 at 03:16 UTC

    Perl optimizes certain nested loop regex constructs similar to (...*)* in order to avoid what is pathological behavior with most regex engines. For non-pathological cases, there is a small cost to this protection.

    The reason the performance significantly decreases is because that crosses the point that triggers the warning "Complex regular subexpression recursion limit (32766) exceeded" (repeatedly).

    - tye        

      Many thanks! I didn't know about warnings. But I noticed new issue. The problem is that when this warning appears the regex doesn't match... I changed line to print "not matched" if $TestString !~ m/^(a+|b)*$/; and for string 256000 and above regex is not able to match. Also when I use possessive quantifiers (m/^(a++|b)*+$/) performance improves a lot but still it does not match for strings larger than 256000.

      I don't understand regex engine in Perl... I also checked the same regexes for Python and PHP (PCRE) and they work similar to Java. I doubt that this is bug in Perl and I cannot match big input strings. There must be something I don't know.

        If I were you, I would file this (the failure case) as a bug against Perl. If nothing else, it might lead to p5p giving you better information than I have off the top of my head.

        Best case, this 32k limit might be removed (a limit that seemed more reasonable when it was implemented but seems increasingly small as time passes).

        - tye        

        Many thanks! I didn't know about warnings. But I noticed new issue. The problem is that when this warning appears the regex doesn't match...
        It's not a new issue it's the same issue :) The recursion depth is limited to prevent catastrophic backtracking (mkay I think it's not actually recursion but it's called that in the docs). Doesn't Friedl describe that in his book? Add use diagnostics; to your program.
Re: Strange behaviour of the regex engine
by Anonymous Monk on Jan 06, 2015 at 01:11 UTC
    Which version of the book are you reading? I would hope this is something explored in the book

      Yes the benchmarking chapter talks about it as does Benchmarks aren't everything ... as least thats my impression

      FWIW, I see similar times with 2008 perl-v5.8.9 and 2012 perl-5.16.1

        I'm reading 3rd version of this book. I think this issue has not been mentioned in the book. Anyway I use two simple techniques (alternation order and plus to prevent backtracking when parentheses are leaved for star) to optimize first regex.

        I read mentioned topic "Benchmarks aren't everything" but I think that it is not the source of problem in this case.