Hello guys,
I'm totally confused... I tried to benchmark two regexes /^(b|a)*$/ and /^(a+|b)*/ against input string "aaaaaaaaab" x <<some big numbers>>. I was sure that the second one is much faster for such string, but when I run my script the results were quite shocking to me (I use similar approach to J. Friedl in his book):
<no of chars x 10 performed 1000 times>;<time in sec - 1st regex>;<tim +e in sec - 2st regex> 200;0.113;0.060 400;0.225;0.112 800;0.448;0.226 1600;0.890;0.763 3200;1.769;1.627 6400;3.539;3.427 12800;7.120;7.215 25600;14.256;58.209 51200;29.847;95.463 102400;65.144;148.634
I don't understand two things. First, why for strings shorter than 128000 chars times are almost the same (IMO the 2nd regex should be faster). Second, why for string larger than 256000 chars the 2nd regex is much slower than the 1st one?
I looked in debug optimizer information, but I didn't find anything particularly interesting (i.e. there are no applicable optimizations for this input string).
It has caused me a headache so I decided to check the same cases in Java. And it seems that regex engine in Java works as I expected:
200;0.145;0.125 400;0.217;0.081 800;0.436;0.165 1600;0.854;0.334 3200;1.83;0.661 6400;4.185;1.327 12800;8.488;2.767 25600;17.791;6.181 51200;37.271;12.856 102400;76.931;24.81
I have no idea why Perl regex engine behaves strangely in this case. If someone could explain it to me I would be grateful.
Perl testing module:Java testing class (it can be executed using command "javac test2.java && java -Xss512m test2"):use Time::HiRes 'time'; #~ use re 'debug'; $TimesToDo = 1000; $NoOfChars = 100; $Multiplier = 2; $NoOfExec = 10; $InputString = "aaaaaaaaab"; for($i = 0; $i < $NoOfExec; $i++){ $NoOfChars = $NoOfChars * $Multiplier; $TestString = $InputString x $NoOfChars; $Count = $TimesToDo; $StartTime = time(); while ($Count-- > 0) { $TestString =~ m/^(b|a)*$/; } $EndTime = time(); $Diff = $EndTime - $StartTime; $Count = $TimesToDo; $StartTime = time(); while ($Count-- > 0) { $TestString =~ m/^(a+|b)*$/; } $EndTime = time(); printf("%d;%.3f;%.3f\n", $NoOfChars, $Diff, $EndTime - $StartTime) +; }
import java.util.regex.*; public class test2 { public static void main(String [] args) { Matcher regex1 = Pattern.compile("^(b|a)*$").matcher(""); Matcher regex2 = Pattern.compile("^(a+|b)*$").matcher(""); int timesToDo = 1000; int noOfChars = 100; int multiplier = 2; int noOfExec = 10; String inputString = "aaaaaaaaab"; for(int j = 0; j < noOfExec; j++){ StringBuffer temp = new StringBuffer(); noOfChars = noOfChars * multiplier; for (int i = 0; i < noOfChars; i++) temp.append(inputString); String testString = temp.toString(); long count = timesToDo; long startTime = System.currentTimeMillis(); while (--count > 0) regex1.reset(testString).find(); double diffSeconds = (System.currentTimeMillis() - startTi +me)/1000.0; count = timesToDo; startTime = System.currentTimeMillis(); while (--count > 0) regex2.reset(testString).find(); double seconds = (System.currentTimeMillis() - startTime)/ +1000.0; System.out.println(noOfChars+";"+diffSeconds+";"+seconds); } } }
Btw. I use Manjaro Linux with Perl 5.20.1.
In reply to Strange behaviour of the regex engine by thof
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |