really large regex misbehaving

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am a monk in training, and I have a regular expression that matches on any text that would be considered a token by a C++ compiler for the purpose of counting lines of code vs comments. Matching against the regex itself is fine; matching against the regex "+" breaks. The conditions for a token string for my purposes are:

EITHER:

at least one character long
no whitespace
no single or double quote marks (keeping in mind that the trigraph ??' is actually a ^)
no C or C++ comments (keeping in mind that the trigraph ??/ is actually a \)

OR:

begins and ends with either ' or " (and ending with the same one it begins with)
escaped quote marks are allowed inside the quoted string, i.e. \', \", ??/', and ??/" are OK
trigraph ??' does not close a single-quoted string

Having somewhat of a background in computing theory, I used state machines to derive "pure" regular expressions (i.e. no lookahead, stateful matching, or anything like that) and ended up with this for a token:

(((\?+|(\?\/|\/)(\?\/)*\?*)|([^\'\"\/\s\?]|((\?|\/)\?+[^\s\?\"])|((\?\
+/|\/)(\?\/)*([^\'\"\/\s\?\*]|\?(\?+[^\?\"\s]|[^\'\"\/\s\?]))))([^\'\"
+\/\s\?]|\?\?+[^\?\"\s]|(\/|\?\/)(\?\/)*([^\'\"\/\s\?\*]|\?(\?+[^\?\"\
+s]|[^\'\"\/\s\?])))*((\?+|(\/|\?\/)(\?\/)*\?*)?))|((\'([^\'?\\]|\\.|\
+?(\?+(\/.|[^?\/])))*(\'|\?\'))|(\"([^\"\\?]|\\.|\?\?+(\/.|[^\?\"\/]))
+*(\"|\?+\"))))
[download]

And as far as I can tell, it works. I store this horrible thing in $token and match m/^$token$/ printing out "match" if it matches and "mismatch" if it doesn't.

while( $line = <> )
{
  if( $line =~ m/^$token$/ )
  {
    print "match\n";
  }
  else
  {
    print "mismatch\n";
  }
}
[download]

Sample output:

x
match
/*x*/
mismatch
"x"
match
"\"
mismatch
"\""
match
"/*x*/"
match
"x"/*x*/
mismatch
/*x*/x
mismatch
x/*x*/
mismatch
//x
mismatch
x="2";
mismatch
[download]

All of these are expected behavior. The x="2"; is actually three tokens in sequence. (I did this to get around the fact that /* and // can occur within strings.) However, if I instead match m/^$token+$/ then comments are erroneously matched.

(blank)
mismatch
x
match
x="2";
match
/*x*/
match
//x
match
[download]

I cannot figure out why this happens. The parenthesis in the regular expression above are balanced, and the entire $token variable is inside a parenthetical. I have considered that something like (regex1)|(regex2)+ might be happening, but that is not the case according to my parenthesis counting.

Please help! This task is maddening enough as is. Is there some size limit to regex's? If anyone suspects a bug, I am using Perl 5.005_03 built for PARISC1.1.

Edited 2001-05-23 by Ovid

Comment on really large regex misbehaving Select or Download Code

Replies are listed 'Best First'.
Re: really large regex misbehaving - WTF by japhy (Canon) on May 22, 2001 at 22:15 UTC
Because your regex decides to match "/" as one valid match, and then "x/" as the second match. `@strings = ( 'x', '/x/', '"x"', '"\"', ); for (@strings) { while (/\G$REx/g) { print "$_ => '$1'"; } print ""; } __END__ x => 'x' /x/ => '/' /x/ => 'x/' "x" => '"x"'` [download] See? Oh, and this is a helpful application of YAPE::Regex::Explain. Here's the output from explain. It'll explain what your regex is doing. Warning: it is very long. Edited 2001-05-22 by Ovid Read more... (14 kB)	[reply] [d/l] [select]
(Ovid) Re: really large regex misbehaving - WTF by Ovid (Cardinal) on May 22, 2001 at 22:12 UTC
Two things: don't use a regex, but if you do, try debugging it with `re 'debug'`. Don't use a regex: you're not actually matching text here, you're trying to parse it. Use Parse::RecDescent or something similar to parse data like this. Regular expressions are powerful, but I don't think they're really suited to the task at hand. If you must stick with a regex, reduce the text to the absolute minimum that gives you unexpected results and use `re 'debug'`. Read the output on the following to get a feel for how it works: `use strict; use re 'debug'; my $test = "123avc 8c7d45"; $test =~ /(c[^c]+d)/; print $1;` [download] Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: really large regex misbehaving - WTF by japhy (Canon) on May 22, 2001 at 22:52 UTC
I have constructed a regex which is far more readable, and does the job (on some simple test cases from your post). It breaks the regex into three parts: single-quoted strings, double-quoted strings, and all others. The single- and double-quoted string parts are very similar. The logic used is: match a quote match as many non-quote, non-backslash, non-question-mark characters as possible then, as many times as possible... match the `\\` or `??/` escape sequence and a character, OR the `??'` sequence, OR a `?` that isn't part of an escape sequence match as many non-quote, non-backslash, non-question-mark characters as possible match the ending quote If that's not possible, then we use the other part. one or more times, match... as long as we aren't about to match a `//` or `/*`... a `??'` or `??/`, OR a `?` that's not part of an escape sequence, OR one or more non-question-marks, non-quotes, and non-whitespace This is a lengthy post, so... Read more... (7 kB)	[reply] [d/l] [select]
Re: Re: really large regex misbehaving - WTF by Anonymous Monk on May 22, 2001 at 23:41 UTC
My tests agree that this works. Thank you so very much! I will have to learn these extended regex functions better.... Also thanks for the tip on the explain package. These lookahead functions appear to go beyond the computational power of traditional regular expressions. (At least, I can't think of a way to implement them fully using normal regex's.) I am starting to wonder whether I was trying to literally do the impossible, though I suspect there is a "pure" regex that could do the job.	[reply]
Re: Re: Re: really large regex misbehaving - WTF by japhy (Canon) on May 23, 2001 at 00:50 UTC
Perl's regular expressions are not regular. `japhy` -- Perl and Regex Hacker	[reply]
Re: really large regex misbehaving - WTF by Anonymous Monk on May 23, 2001 at 00:34 UTC
try /($token)+/! the + goes to the last component of your re	[reply]