Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
I am a monk in training, and I have a regular expression that matches on any text that would be considered a token by a C++ compiler for the purpose of counting lines of code vs comments. Matching against the regex itself is fine; matching against the regex "+" breaks. The conditions for a token string for my purposes are:
EITHER:
OR:
Having somewhat of a background in computing theory, I used state machines to derive "pure" regular expressions (i.e. no lookahead, stateful matching, or anything like that) and ended up with this for a token:
And as far as I can tell, it works. I store this horrible thing in $token and match m/^$token$/ printing out "match" if it matches and "mismatch" if it doesn't.(((\?+|(\?\/|\/)(\?\/)*\?*)|([^\'\"\/\s\?]|((\?|\/)\?+[^\s\?\"])|((\?\ +/|\/)(\?\/)*([^\'\"\/\s\?\*]|\?(\?+[^\?\"\s]|[^\'\"\/\s\?]))))([^\'\" +\/\s\?]|\?\?+[^\?\"\s]|(\/|\?\/)(\?\/)*([^\'\"\/\s\?\*]|\?(\?+[^\?\"\ +s]|[^\'\"\/\s\?])))*((\?+|(\/|\?\/)(\?\/)*\?*)?))|((\'([^\'?\\]|\\.|\ +?(\?+(\/.|[^?\/])))*(\'|\?\'))|(\"([^\"\\?]|\\.|\?\?+(\/.|[^\?\"\/])) +*(\"|\?+\"))))
while( $line = <> ) { if( $line =~ m/^$token$/ ) { print "match\n"; } else { print "mismatch\n"; } }
Sample output:
x match /*x*/ mismatch "x" match "\" mismatch "\"" match "/*x*/" match "x"/*x*/ mismatch /*x*/x mismatch x/*x*/ mismatch //x mismatch x="2"; mismatch
All of these are expected behavior. The x="2"; is actually three tokens in sequence. (I did this to get around the fact that /* and // can occur within strings.) However, if I instead match m/^$token+$/ then comments are erroneously matched.
(blank) mismatch x match x="2"; match /*x*/ match //x match
I cannot figure out why this happens. The parenthesis in the regular expression above are balanced, and the entire $token variable is inside a parenthetical. I have considered that something like (regex1)|(regex2)+ might be happening, but that is not the case according to my parenthesis counting.
Please help! This task is maddening enough as is. Is there some size limit to regex's? If anyone suspects a bug, I am using Perl 5.005_03 built for PARISC1.1.
Edited 2001-05-23 by Ovid
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: really large regex misbehaving - WTF
by japhy (Canon) on May 22, 2001 at 22:15 UTC | |
|
(Ovid) Re: really large regex misbehaving - WTF
by Ovid (Cardinal) on May 22, 2001 at 22:12 UTC | |
|
Re: really large regex misbehaving - WTF
by japhy (Canon) on May 22, 2001 at 22:52 UTC | |
by Anonymous Monk on May 22, 2001 at 23:41 UTC | |
by japhy (Canon) on May 23, 2001 at 00:50 UTC | |
|
Re: really large regex misbehaving - WTF
by Anonymous Monk on May 23, 2001 at 00:34 UTC |