Strange behaviour of the regex engine

thof has asked for the wisdom of the Perl Monks concerning the following question:

Hello guys,

I'm totally confused... I tried to benchmark two regexes /^(b|a)*$/ and /^(a+|b)*/ against input string "aaaaaaaaab" x <<some big numbers>>. I was sure that the second one is much faster for such string, but when I run my script the results were quite shocking to me (I use similar approach to J. Friedl in his book):

<no of chars x 10 performed 1000 times>;<time in sec - 1st regex>;<tim
+e in sec - 2st regex>
200;0.113;0.060
400;0.225;0.112
800;0.448;0.226
1600;0.890;0.763
3200;1.769;1.627
6400;3.539;3.427
12800;7.120;7.215
25600;14.256;58.209
51200;29.847;95.463
102400;65.144;148.634
[download]

I don't understand two things. First, why for strings shorter than 128000 chars times are almost the same (IMO the 2nd regex should be faster). Second, why for string larger than 256000 chars the 2nd regex is much slower than the 1st one?

I looked in debug optimizer information, but I didn't find anything particularly interesting (i.e. there are no applicable optimizations for this input string).

It has caused me a headache so I decided to check the same cases in Java. And it seems that regex engine in Java works as I expected:

200;0.145;0.125
400;0.217;0.081
800;0.436;0.165
1600;0.854;0.334
3200;1.83;0.661
6400;4.185;1.327
12800;8.488;2.767
25600;17.791;6.181
51200;37.271;12.856
102400;76.931;24.81
[download]

I have no idea why Perl regex engine behaves strangely in this case. If someone could explain it to me I would be grateful.

Perl testing module:

use Time::HiRes 'time';
#~ use re 'debug';

$TimesToDo = 1000;
$NoOfChars = 100;
$Multiplier = 2;
$NoOfExec = 10;
$InputString = "aaaaaaaaab";

for($i = 0; $i < $NoOfExec; $i++){
    $NoOfChars = $NoOfChars * $Multiplier;
    $TestString = $InputString x $NoOfChars;
    $Count = $TimesToDo;
    $StartTime = time();
    while ($Count-- > 0) {
        $TestString =~ m/^(b|a)*$/;
    }
    $EndTime = time();
    $Diff = $EndTime - $StartTime;

    $Count = $TimesToDo;
    $StartTime = time();
    while ($Count-- > 0) {
        $TestString =~ m/^(a+|b)*$/;
    }
    $EndTime = time();
    printf("%d;%.3f;%.3f\n", $NoOfChars, $Diff, $EndTime - $StartTime)
+;
}
[download]

Java testing class (it can be executed using command "javac test2.java && java -Xss512m test2"):

import java.util.regex.*;
public class test2 {
    public static void main(String [] args)
    {
        Matcher regex1 = Pattern.compile("^(b|a)*$").matcher("");
        Matcher regex2 = Pattern.compile("^(a+|b)*$").matcher("");
        int timesToDo = 1000;
        int noOfChars = 100;
        int multiplier = 2;
        int noOfExec = 10;
        String inputString = "aaaaaaaaab";

        for(int j = 0; j < noOfExec; j++){
            StringBuffer temp = new StringBuffer();
            noOfChars = noOfChars * multiplier;
            for (int i = 0; i < noOfChars; i++)
                temp.append(inputString);
            String testString = temp.toString();

            long count = timesToDo;
            long startTime = System.currentTimeMillis();
            while (--count > 0)
                regex1.reset(testString).find();
            double diffSeconds = (System.currentTimeMillis() - startTi
+me)/1000.0;

            count = timesToDo;
            startTime = System.currentTimeMillis();
            while (--count > 0)
                regex2.reset(testString).find();
            double seconds = (System.currentTimeMillis() - startTime)/
+1000.0;
            System.out.println(noOfChars+";"+diffSeconds+";"+seconds);
        }
    }
}
[download]

Btw. I use Manjaro Linux with Perl 5.20.1.

Comment on Strange behaviour of the regex engine Select or Download Code

Replies are listed 'Best First'.
Re: Strange behaviour of the regex engine (pathological protection) by tye (Sage) on Jan 06, 2015 at 03:16 UTC
Perl optimizes certain nested loop regex constructs similar to (...) in order to avoid what is pathological behavior with most regex engines. For non-pathological cases, there is a small cost to this protection. The reason the performance significantly decreases is because that crosses the point that triggers the warning "Complex regular subexpression recursion limit (32766) exceeded" (repeatedly). - tye	[reply]
Re^2: Strange behaviour of the regex engine (pathological protection) by thof (Initiate) on Jan 06, 2015 at 18:53 UTC
Many thanks! I didn't know about warnings. But I noticed new issue. The problem is that when this warning appears the regex doesn't match... I changed line to `print "not matched" if $TestString !~ m/^(a+\|b)$/;` and for string 256000 and above regex is not able to match. Also when I use possessive quantifiers (m/^(a++\|b)+$/) performance improves a lot but still it does not match for strings larger than 256000. I don't understand regex engine in Perl... I also checked the same regexes for Python and PHP (PCRE) and they work similar to Java. I doubt that this is bug in Perl and I cannot match big input strings. There must be something I don't know.	[reply] [d/l]
Re^3: Strange behaviour of the regex engine (bug) by tye (Sage) on Jan 06, 2015 at 21:46 UTC
If I were you, I would file this (the failure case) as a bug against Perl. If nothing else, it might lead to p5p giving you better information than I have off the top of my head. Best case, this 32k limit might be removed (a limit that seemed more reasonable when it was implemented but seems increasingly small as time passes). - tye	[reply]
Re^3: Strange behaviour of the regex engine (pathological protection) by Anonymous Monk on Jan 06, 2015 at 21:52 UTC
Many thanks! I didn't know about warnings. But I noticed new issue. The problem is that when this warning appears the regex doesn't match... It's not a new issue it's the same issue :) The recursion depth is limited to prevent catastrophic backtracking (mkay I think it's not actually recursion but it's called that in the docs). Doesn't Friedl describe that in his book? Add `use diagnostics;` to your program.	[reply] [d/l]
Re: Strange behaviour of the regex engine by Anonymous Monk on Jan 06, 2015 at 01:11 UTC
Which version of the book are you reading? I would hope this is something explored in the book	[reply]
Re^2: Strange behaviour of the regex engine ( Benchmarks aren't everything ) by Anonymous Monk on Jan 06, 2015 at 01:45 UTC
Yes the benchmarking chapter talks about it as does Benchmarks aren't everything ... as least thats my impression FWIW, I see similar times with 2008 perl-v5.8.9 and 2012 perl-5.16.1	[reply]
Re^3: Strange behaviour of the regex engine ( Benchmarks aren't everything ) by thof (Initiate) on Jan 06, 2015 at 19:04 UTC
I'm reading 3rd version of this book. I think this issue has not been mentioned in the book. Anyway I use two simple techniques (alternation order and plus to prevent backtracking when parentheses are leaved for star) to optimize first regex. I read mentioned topic "Benchmarks aren't everything" but I think that it is not the source of problem in this case.	[reply]
Re^2: Strange behaviour of the regex engine by Anonymous Monk on Jan 06, 2015 at 01:11 UTC
Jeffrey Friedl's Mastering Regular Expressions	[reply]