The slowness is actually an effect of the regex. The way your regex is constructed, the regex engine potentially has to do a lot of backtracking to try to find a match.

 

Here's a similar example that demonstrates the same problem: qq{"The quick brown fox jumps over the lazy dog\n"} =~ /("(\w+| )*")/ The (\w+| )* part can match the word 'Just' in many ways: ('Just'), or ('Jus', 't'), or ('Ju', 'st'), or ('Ju', 's', 't'), or... Each time the regex engine gets to the newline and fails to match the second quote, it backtracks and tries another way of matching the words. It's the nested quantifiers that get you.

The solution is to restructure the regex so that it can only match a part of the string in a limited number of ways, to eliminate all the useless backtracking. (Very easy in this case, since the regex is so simple.) qq{"The quick brown fox jumps over the lazy dog\n"} =~ /("[\w ]*")/

 

This is what you did when you moved the space inside the character class and removed the nested quantifiers. Here's one way to fix your regex, without changing the semantics: (?:\w[\.\w\-\'\!\(\)\/]* +)*\w[\.\w\-\'\!\(\)\/]* Each iteration of (?:\w[\.\w\-\'\!\(\)\/]* +)* has to match at least one word character, followed by at least one space. There's only one way for this regex to match a string.

 

As perl's regex engine has been improved, various optimizations have been added to avoid this exponential backtracking problem. That's probably why your code ran so much faster on Unix; I expect you were using 5.6.0 or 5.6.1 there. My simple example shows the same behavior, returning immediately in 5.6.1 and taking a loooong time to finish in 5.005_03.

Jeffrey Friedl discusses this technique, which he calls "unrolling the loop", in Mastering Regular Expressions.


In reply to Re: Speed of regex on compiled perl under windows by chipmunk
in thread Speed of regex on compiled perl under windows by blkstrat

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.