why this regular expression is so slow?

fnever has asked for the wisdom of the Perl Monks concerning the following question:

perl -e '$a="ATEH...GYS....I..CVNV...FGQ.....RTY.....F.YC.EY....G....D
+....TE....RL...R..E....V.IA..SVG.E.L-.-.---..-----...................
+...........................................................VPE.p-RTP.
+Y.A..M.S..V.TPA.TK---------.......-........---TS........IYGY.GT.R.P..
+......VP.......D...LQCVSISNWSMAR.......................KI..GE.Y...L..
+LE...-....--..-.....-..--........--...R..GF.P....V..Y................
+.....................................................................
+..............-...-..................................................
+....-.--....---EARV.D.P.L.TRL..V....IDR..RIT.T.............F.........
+..GW.............CS........V.N.......RY..--.----.......-....-.--.....
+.....................................................-...............
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+................-.-.---DW.RH..Q.G...R....S..........ST...C.....D.....
+IEV..D....CD.V.S...DL.V...A..V....P..D..DN.SW.-.......--..-P.L..YRC..
+L.S.F........DI.EC.....M..S..A.D......GG...---FPCAEKP...D.DIV........
+.I.QI..S.CV.CY.ET...G--.----..-Ggsadgaggvgvnggatpaptqshh.............
+.....................................................................
+........................gngesggipyypqaatagvgaqppVFG....R..SG..L......
+.H.....L......F.TI..GT.CDPVG--.....---..-.-...-..-.-----.............
+...---.----...........................EADV..Y..EF--.P....SEY.EL.LLGFM
+L.FFQRY.A..PSFMTGYNI.N...S..FDLKYILTR..L.E......Y..............L-.-.-
+..-..-..--...Y..K........................L.DV.QR.F..C.KVPV..AQGG....R
+FFLHSPT---..-...-LYrRH.Q.....D..RAS.FaaaA.A.NA..P..T.K...............
+V.....Y........IS..G..CV.VVD.MY.PVCM....A....KT.--......N...SP.N.YK.L
+N";if($a=~/(\S)&#91;\.-&#93;*(\S)&#91;\.-&#93;*(\S)&#91;\.-&#93;*(\S)
+&#91;\.-&#93;*$/s){print "$1$2$3$4\n";}'
[download]

this simple command line takes hours to finish...does anyone have a better solution?

ok,i have a better solution,but why?...

perl -e '$a="ATEH...GYS....I..CVNV...FGQ.....RTY.....F.YC.EY....G....D
+....TE....RL...R..E....V.IA..SVG.E.L-.-.---..-----...................
+...........................................................VPE.p-RTP.
+Y.A..M.S..V.TPA.TK---------.......-........---TS........IYGY.GT.R.P..
+......VP.......D...LQCVSISNWSMAR.......................KI..GE.Y...L..
+LE...-....--..-.....-..--........--...R..GF.P....V..Y................
+.....................................................................
+..............-...-..................................................
+....-.--....---EARV.D.P.L.TRL..V....IDR..RIT.T.............F.........
+..GW.............CS........V.N.......RY..--.----.......-....-.--.....
+.....................................................-...............
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+.....................................................................
+................-.-.---DW.RH..Q.G...R....S..........ST...C.....D.....
+IEV..D....CD.V.S...DL.V...A..V....P..D..DN.SW.-.......--..-P.L..YRC..
+L.S.F........DI.EC.....M..S..A.D......GG...---FPCAEKP...D.DIV........
+.I.QI..S.CV.CY.ET...G--.----..-Ggsadgaggvgvnggatpaptqshh.............
+.....................................................................
+........................gngesggipyypqaatagvgaqppVFG....R..SG..L......
+.H.....L......F.TI..GT.CDPVG--.....---..-.-...-..-.-----.............
+...---.----...........................EADV..Y..EF--.P....SEY.EL.LLGFM
+L.FFQRY.A..PSFMTGYNI.N...S..FDLKYILTR..L.E......Y..............L-.-.-
+..-..-..--...Y..K........................L.DV.QR.F..C.KVPV..AQGG....R
+FFLHSPT---..-...-LYrRH.Q.....D..RAS.FaaaA.A.NA..P..T.K...............
+V.....Y........IS..G..CV.VVD.MY.PVCM....A....KT.--......N...SP.N.YK.L
+N";if($a=~/.*(\w)&#91;\.-&#93;*(\w)&#91;\.-&#93;*(\w)&#91;\.-&#93;*(\
+w)&#91;\.-&#93;*/s){print "$1$2$3$4\n";}'
[download]

Comment on why this regular expression is so slow? Select or Download Code

Replies are listed 'Best First'.
Re: why this regular expression is so slow? by ELISHEVA (Prior) on Jul 15, 2009 at 07:30 UTC
Simply changing "\S" to "\w" converts a regex that takes forever to an instant match (at least this is so on my machine - Debian Etch, Perl 5.8.8). The reason is that `\S` and `[\.-]` can both match "." and "-". Without a distinctive character to match on, each time a match fails, the regex engine needs to backtrack and try the other regex. And there are a lot of "." and "-" to provide opportunities for backtracking. Additionally, because `[\.-]` is greedy and can match an indefinite number of elements, the regex engine may have to process a large portion of a very long string before it discovers that a match has failed. Changing `\S` to `\w` results in a mutually exclusive match, thus eliminating the majority of backtracking and producing an instant match. Best, beth Note: for those who don't want to hunt down an ASCII table `[\.-]` is `[\.-]`.	[reply] [d/l] [select]
Re: why this regular expression is so slow? by moritz (Cardinal) on Jul 15, 2009 at 07:13 UTC
It's slow because the quantified char classes can match a zero-width string, thus causing a massive amount of backtracking. If you give us a verbal description of what you want to achieve, maybe we can come up with a better solution that involves less backtracking.	[reply]
Re: why this regular expression is so slow? by AnomalousMonk (Archbishop) on Jul 15, 2009 at 07:52 UTC
I'm assuming the two regexes really are: slow: `/(\S)[\.-](\S)[\.-](\S)[\.-](\S)[\.-]$/s` and fast: `/.(\w)[\.-](\w)[\.-](\w)[\.-](\w)[\.-]/s` After staring for a while at the string you're trying to match against, I'm also assuming that it has no spaces in it, so `\S` (a non-whitespace character) matches any single character* in the string. Here's my intepretation of the slow regex: `/(\S)[\.-](\S)[\.-](\S)[\.-](\S)[\.-]$/s` The regex is anchored at the end of the string, but the regex engine (RE) looks for the leftmost, longest match, so it has to start looking at the beginning of the string. It looks for anything, zero or more dots or dashes, anything, zero or more dots or dashes, anything, zero or more dots or dashes, anything, zero or more dots or dashes, and, having found a substring matching this, it looks for the end of the string. If the RE fails to find the end of the string at that point, it starts to backtrack: it gives up every possible variation of dots, dashes or anything and then checks again that the end of the string is at the end of this variation. The RE continues to do this until it exhausts every possible variation that can be extracted from the original substring because none of these, I'm guessing, will also match with the `$` end-of-string assertion. Then the RE advances the 'match point' by one character position to the second character in the string and tries the whole thing over again. You have entered Backtrack Hell; you are lucky it only takes you a few hours to get out of it. Likewise, the fast regex: `/.(\w)[\.-](\w)[\.-](\w)[\.-](\w)[\.-]/s` The first thing to note is that you are matching `\w` (a 'word' character) instead of `\S` (effectively, anything), and that the characters matched by `\w` and `[\.-]` are mutually exclusive. This is an enormous help to the poor RE. The other thing to note is that the regex begins with a `.` that immediately consumes every character to the end of the string before the RE begins backtracking from the end to try to find a pattern not anchored at the end of the string: another huge help. After a reasonable amount of backtracking, a match is found (if my visual inspection of the string is correct). A small point: Since the `.` metacharacter ('any character') is not used in the regex, the `//s` regex modifier has no effect, although it does no harm.	[reply] [d/l] [select]
Re^2: why this regular expression is so slow? by fnever (Initiate) on Jul 16, 2009 at 00:35 UTC
this is really helpful,thanks	[reply]
Re: why this regular expression is so slow? by CountZero (Bishop) on Jul 15, 2009 at 06:02 UTC
None of these examples "work", since you have included htlm entities (`]`, ...) in your regex. Please edit your code so it contains the real characters you want to match, there is no need to encode the characters within the `<code> ... </code>`-tags. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: why this regular expression is so slow? by AnomalousMonk (Archbishop) on Jul 15, 2009 at 06:47 UTC
CounterZero++ And BTW, fnever: what's with the long string of `<BR />`s in the middle of the OP? Please Update that.	[reply] [d/l]
Re: why this regular expression is so slow? by JavaFan (Canon) on Jul 15, 2009 at 09:38 UTC
The reason it's slow is that you're anchoring the regexp to the end, but you have a other matches in the string which only fail because they don't end at the end of the string. The obvious answer here is: don't use a regexp. It seems you just want the last four characters in the string that aren't dashes or dots. You could simple remove the dashes and dots from the string, and take the last four characters from the resulting string. (You may need some special handling if the string may contain less than four such characters).	[reply]
Re^2: why this regular expression is so slow? by Corion (Patriarch) on Jul 15, 2009 at 09:44 UTC
Alternatively, sexeger points out the approach of reversing the input and then taking the first four non-dash/dots characters, which should also happen without much backtracking.	[reply]
Re: why this regular expression is so slow? by ikegami (Patriarch) on Jul 15, 2009 at 14:11 UTC
I assume you want `$1`, `$2`, `$3` and `$4` to be something other than "`.`" and "`-`". If so, how about: `perl -le' $_ = "..."; s/[\.-]+//g; if (/(\S)(\S)(\S)(\S)$/) { print "$1$2$3$4" } '` [download] or just `perl -le' $_ = "..."; s/[\.-]+//g; print substr $_, -4; '` [download]	[reply] [d/l] [select]