in reply to Regex weirdness?

I've benchmarked four regexes. First the one that you claimed to be your original (which actually isn't quite correct, as it doesn't deal with backslashes correctly). Then Hugo's suggestion. Third, the standard unrolling technique - but one that has different cases for double and single quotes, and finally, one that uses unrolling, and doesn't have different cases for single and double quotes.
#!/usr/bin/perl use strict; use warnings; use Benchmark qw /cmpthese/; our $orig = qr {( [^"']+ | (?:"(?:[^"]|\\")*") | (?:'(?:[^']|\\')*') )}xs; our $hv = qr {( [^"']+ | (["']) (?: \\ . | (?!\2) . )* \2 )}xs; our $unroll = qr {( [^"']+ | " [^"]* (?: \\. [^"]*)* " | ' [^']* (?: \\. [^']*)* ' )}xs; our $code = qr {( [^"']+ | (["']) (??{ "[^$2]*(?:\\\\.[^$2]*)*" }) \2 )}xs; our $str = `cat /tmp/pp.c`; # pp.c from the perl 5.8.5 sources. cmpthese(-10, { orig => 'while ($str =~ /$orig/g) {1}', hv => 'while ($str =~ /$hv/g) {1}', unroll => 'while ($str =~ /$unroll/g) {1}', code => 'while ($str =~ /$code/g) {1}', }); __END__ Rate orig hv code unroll orig 13.0/s -- -3% -92% -96% hv 13.4/s 3% -- -92% -96% code 166/s 1179% 1139% -- -52% unroll 344/s 2546% 2464% 107% --
Note the huge benefits of unrolling.