Re: why this regular expression is so slow?

I'm assuming the two regexes really are:
slow:
/(\S)[\.-]*(\S)[\.-]*(\S)[\.-]*(\S)[\.-]*$/s
and fast:
/.*(\w)[\.-]*(\w)[\.-]*(\w)[\.-]*(\w)[\.-]*/s

After staring for a while at the string you're trying to match against, I'm also assuming that it has no spaces in it, so \S (a non-whitespace character) matches any single character in the string.

Here's my intepretation of the slow regex:
/(\S)[\.-]*(\S)[\.-]*(\S)[\.-]*(\S)[\.-]*$/s
The regex is anchored at the end of the string, but the regex engine (RE) looks for the leftmost, longest match, so it has to start looking at the beginning of the string. It looks for anything, zero or more dots or dashes, anything, zero or more dots or dashes, anything, zero or more dots or dashes, anything, zero or more dots or dashes, and, having found a substring matching this, it looks for the end of the string. If the RE fails to find the end of the string at that point, it starts to backtrack: it gives up every possible variation of dots, dashes or anything and then checks again that the end of the string is at the end of this variation. The RE continues to do this until it exhausts every possible variation that can be extracted from the original substring because none of these, I'm guessing, will also match with the $ end-of-string assertion. Then the RE advances the 'match point' by one character position to the second character in the string and tries the whole thing over again. You have entered Backtrack Hell; you are lucky it only takes you a few hours to get out of it.

Likewise, the fast regex:
/.*(\w)[\.-]*(\w)[\.-]*(\w)[\.-]*(\w)[\.-]*/s
The first thing to note is that you are matching \w (a 'word' character) instead of \S (effectively, anything), and that the characters matched by \w and [\.-] are mutually exclusive. This is an enormous help to the poor RE. The other thing to note is that the regex begins with a .* that immediately consumes every character to the end of the string before the RE begins backtracking from the end to try to find a pattern not anchored at the end of the string: another huge help. After a reasonable amount of backtracking, a match is found (if my visual inspection of the string is correct).

A small point: Since the . metacharacter ('any character') is not used in the regex, the //s regex modifier has no effect, although it does no harm.

Comment on Re: why this regular expression is so slow? Select or Download Code

Replies are listed 'Best First'.
Re^2: why this regular expression is so slow? by fnever (Initiate) on Jul 16, 2009 at 00:35 UTC
this is really helpful,thanks	[reply]