I am a monk in training, and I have a regular expression that matches on any text that would be considered a token by a C++ compiler for the purpose of counting lines of code vs comments. Matching against the regex itself is fine; matching against the regex "+" breaks. The conditions for a token string for my purposes are:
EITHER:
OR:
Having somewhat of a background in computing theory, I used state machines to derive "pure" regular expressions (i.e. no lookahead, stateful matching, or anything like that) and ended up with this for a token:
And as far as I can tell, it works. I store this horrible thing in $token and match m/^$token$/ printing out "match" if it matches and "mismatch" if it doesn't.(((\?+|(\?\/|\/)(\?\/)*\?*)|([^\'\"\/\s\?]|((\?|\/)\?+[^\s\?\"])|((\?\ +/|\/)(\?\/)*([^\'\"\/\s\?\*]|\?(\?+[^\?\"\s]|[^\'\"\/\s\?]))))([^\'\" +\/\s\?]|\?\?+[^\?\"\s]|(\/|\?\/)(\?\/)*([^\'\"\/\s\?\*]|\?(\?+[^\?\"\ +s]|[^\'\"\/\s\?])))*((\?+|(\/|\?\/)(\?\/)*\?*)?))|((\'([^\'?\\]|\\.|\ +?(\?+(\/.|[^?\/])))*(\'|\?\'))|(\"([^\"\\?]|\\.|\?\?+(\/.|[^\?\"\/])) +*(\"|\?+\"))))
while( $line = <> ) { if( $line =~ m/^$token$/ ) { print "match\n"; } else { print "mismatch\n"; } }
Sample output:
x match /*x*/ mismatch "x" match "\" mismatch "\"" match "/*x*/" match "x"/*x*/ mismatch /*x*/x mismatch x/*x*/ mismatch //x mismatch x="2"; mismatch
All of these are expected behavior. The x="2"; is actually three tokens in sequence. (I did this to get around the fact that /* and // can occur within strings.) However, if I instead match m/^$token+$/ then comments are erroneously matched.
(blank) mismatch x match x="2"; match /*x*/ match //x match
I cannot figure out why this happens. The parenthesis in the regular expression above are balanced, and the entire $token variable is inside a parenthetical. I have considered that something like (regex1)|(regex2)+ might be happening, but that is not the case according to my parenthesis counting.
Please help! This task is maddening enough as is. Is there some size limit to regex's? If anyone suspects a bug, I am using Perl 5.005_03 built for PARISC1.1.
Edited 2001-05-23 by Ovid
In reply to really large regex misbehaving by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |