blkstrat has asked for the wisdom of the Perl Monks concerning the following question:

O monks: I have a regular expression conundrum that seems to be platform related. The company that I am working for has some scripts that process documents, and in order to ship them for windows, we compile a perl (5.005.03) so that we can qa with a known version. I am noticing that a regular expression which runs very quickly on linux takes considerably longer running on Win 2k. The part of the regular expression which I am looking at changing is this:
(?:\w[\.\w\-\'\!\(\)\/\]|[ ])+

I tried removing the alteration, thusly:
(?:\w[\.\w\-\'\!\(\)\/]+[ ]?)+

but got about the same speed.
However, removing the clustering:
(\w[\.\w\-\'\!\(\)\/ ])+

speeded this up considerably on Win32 and screams on linux. (This is a partial section of a long regex, but this is the section that was changeable).

My question, then, is: Do any monks have experience with speed differences of regular expressions under Windows, or compilation suggestions that would improve things. Thanks in advance, Blkstrat

Replies are listed 'Best First'.
Re: Speed of regex on compiled perl under windows
by chipmunk (Parson) on Nov 21, 2001 at 01:24 UTC
    The slowness is actually an effect of the regex. The way your regex is constructed, the regex engine potentially has to do a lot of backtracking to try to find a match.

     

    Here's a similar example that demonstrates the same problem: qq{"The quick brown fox jumps over the lazy dog\n"} =~ /("(\w+| )*")/ The (\w+| )* part can match the word 'Just' in many ways: ('Just'), or ('Jus', 't'), or ('Ju', 'st'), or ('Ju', 's', 't'), or... Each time the regex engine gets to the newline and fails to match the second quote, it backtracks and tries another way of matching the words. It's the nested quantifiers that get you.

    The solution is to restructure the regex so that it can only match a part of the string in a limited number of ways, to eliminate all the useless backtracking. (Very easy in this case, since the regex is so simple.) qq{"The quick brown fox jumps over the lazy dog\n"} =~ /("[\w ]*")/

     

    This is what you did when you moved the space inside the character class and removed the nested quantifiers. Here's one way to fix your regex, without changing the semantics: (?:\w[\.\w\-\'\!\(\)\/]* +)*\w[\.\w\-\'\!\(\)\/]* Each iteration of (?:\w[\.\w\-\'\!\(\)\/]* +)* has to match at least one word character, followed by at least one space. There's only one way for this regex to match a string.

     

    As perl's regex engine has been improved, various optimizations have been added to avoid this exponential backtracking problem. That's probably why your code ran so much faster on Unix; I expect you were using 5.6.0 or 5.6.1 there. My simple example shows the same behavior, returning immediately in 5.6.1 and taking a loooong time to finish in 5.005_03.

    Jeffrey Friedl discusses this technique, which he calls "unrolling the loop", in Mastering Regular Expressions.