Hello guys,

I'm totally confused... I tried to benchmark two regexes /^(b|a)*$/ and /^(a+|b)*/ against input string "aaaaaaaaab" x <<some big numbers>>. I was sure that the second one is much faster for such string, but when I run my script the results were quite shocking to me (I use similar approach to J. Friedl in his book):

<no of chars x 10 performed 1000 times>;<time in sec - 1st regex>;<tim +e in sec - 2st regex> 200;0.113;0.060 400;0.225;0.112 800;0.448;0.226 1600;0.890;0.763 3200;1.769;1.627 6400;3.539;3.427 12800;7.120;7.215 25600;14.256;58.209 51200;29.847;95.463 102400;65.144;148.634

I don't understand two things. First, why for strings shorter than 128000 chars times are almost the same (IMO the 2nd regex should be faster). Second, why for string larger than 256000 chars the 2nd regex is much slower than the 1st one?

I looked in debug optimizer information, but I didn't find anything particularly interesting (i.e. there are no applicable optimizations for this input string).

It has caused me a headache so I decided to check the same cases in Java. And it seems that regex engine in Java works as I expected:

200;0.145;0.125 400;0.217;0.081 800;0.436;0.165 1600;0.854;0.334 3200;1.83;0.661 6400;4.185;1.327 12800;8.488;2.767 25600;17.791;6.181 51200;37.271;12.856 102400;76.931;24.81

I have no idea why Perl regex engine behaves strangely in this case. If someone could explain it to me I would be grateful.

Perl testing module:
use Time::HiRes 'time'; #~ use re 'debug'; $TimesToDo = 1000; $NoOfChars = 100; $Multiplier = 2; $NoOfExec = 10; $InputString = "aaaaaaaaab"; for($i = 0; $i < $NoOfExec; $i++){ $NoOfChars = $NoOfChars * $Multiplier; $TestString = $InputString x $NoOfChars; $Count = $TimesToDo; $StartTime = time(); while ($Count-- > 0) { $TestString =~ m/^(b|a)*$/; } $EndTime = time(); $Diff = $EndTime - $StartTime; $Count = $TimesToDo; $StartTime = time(); while ($Count-- > 0) { $TestString =~ m/^(a+|b)*$/; } $EndTime = time(); printf("%d;%.3f;%.3f\n", $NoOfChars, $Diff, $EndTime - $StartTime) +; }
Java testing class (it can be executed using command "javac test2.java && java -Xss512m test2"):
import java.util.regex.*; public class test2 { public static void main(String [] args) { Matcher regex1 = Pattern.compile("^(b|a)*$").matcher(""); Matcher regex2 = Pattern.compile("^(a+|b)*$").matcher(""); int timesToDo = 1000; int noOfChars = 100; int multiplier = 2; int noOfExec = 10; String inputString = "aaaaaaaaab"; for(int j = 0; j < noOfExec; j++){ StringBuffer temp = new StringBuffer(); noOfChars = noOfChars * multiplier; for (int i = 0; i < noOfChars; i++) temp.append(inputString); String testString = temp.toString(); long count = timesToDo; long startTime = System.currentTimeMillis(); while (--count > 0) regex1.reset(testString).find(); double diffSeconds = (System.currentTimeMillis() - startTi +me)/1000.0; count = timesToDo; startTime = System.currentTimeMillis(); while (--count > 0) regex2.reset(testString).find(); double seconds = (System.currentTimeMillis() - startTime)/ +1000.0; System.out.println(noOfChars+";"+diffSeconds+";"+seconds); } } }

Btw. I use Manjaro Linux with Perl 5.20.1.


In reply to Strange behaviour of the regex engine by thof

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.