Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am a monk in training, and I have a regular expression that matches on any text that would be considered a token by a C++ compiler for the purpose of counting lines of code vs comments. Matching against the regex itself is fine; matching against the regex "+" breaks. The conditions for a token string for my purposes are:

EITHER:

OR:

Having somewhat of a background in computing theory, I used state machines to derive "pure" regular expressions (i.e. no lookahead, stateful matching, or anything like that) and ended up with this for a token:

(((\?+|(\?\/|\/)(\?\/)*\?*)|([^\'\"\/\s\?]|((\?|\/)\?+[^\s\?\"])|((\?\ +/|\/)(\?\/)*([^\'\"\/\s\?\*]|\?(\?+[^\?\"\s]|[^\'\"\/\s\?]))))([^\'\" +\/\s\?]|\?\?+[^\?\"\s]|(\/|\?\/)(\?\/)*([^\'\"\/\s\?\*]|\?(\?+[^\?\"\ +s]|[^\'\"\/\s\?])))*((\?+|(\/|\?\/)(\?\/)*\?*)?))|((\'([^\'?\\]|\\.|\ +?(\?+(\/.|[^?\/])))*(\'|\?\'))|(\"([^\"\\?]|\\.|\?\?+(\/.|[^\?\"\/])) +*(\"|\?+\"))))

And as far as I can tell, it works. I store this horrible thing in $token and match m/^$token$/ printing out "match" if it matches and "mismatch" if it doesn't.

while( $line = <> ) { if( $line =~ m/^$token$/ ) { print "match\n"; } else { print "mismatch\n"; } }

Sample output:

x match /*x*/ mismatch "x" match "\" mismatch "\"" match "/*x*/" match "x"/*x*/ mismatch /*x*/x mismatch x/*x*/ mismatch //x mismatch x="2"; mismatch

All of these are expected behavior. The x="2"; is actually three tokens in sequence. (I did this to get around the fact that /* and // can occur within strings.) However, if I instead match m/^$token+$/ then comments are erroneously matched.

(blank) mismatch x match x="2"; match /*x*/ match //x match

I cannot figure out why this happens. The parenthesis in the regular expression above are balanced, and the entire $token variable is inside a parenthetical. I have considered that something like (regex1)|(regex2)+ might be happening, but that is not the case according to my parenthesis counting.

Please help! This task is maddening enough as is. Is there some size limit to regex's? If anyone suspects a bug, I am using Perl 5.005_03 built for PARISC1.1.

Edited 2001-05-23 by Ovid

Replies are listed 'Best First'.
Re: really large regex misbehaving - WTF
by japhy (Canon) on May 22, 2001 at 22:15 UTC
    Because your regex decides to match "/" as one valid match, and then "*x*/" as the second match.
    @strings = ( 'x', '/*x*/', '"x"', '"\"', ); for (@strings) { while (/\G$REx/g) { print "$_ => '$1'"; } print ""; } __END__ x => 'x' /*x*/ => '/' /*x*/ => '*x*/' "x" => '"x"'
    See? Oh, and this is a helpful application of YAPE::Regex::Explain. Here's the output from explain. It'll explain what your regex is doing.

    Warning: it is very long.

    Edited 2001-05-22 by Ovid

(Ovid) Re: really large regex misbehaving - WTF
by Ovid (Cardinal) on May 22, 2001 at 22:12 UTC

    Two things: don't use a regex, but if you do, try debugging it with re 'debug'.

    Don't use a regex: you're not actually matching text here, you're trying to parse it. Use Parse::RecDescent or something similar to parse data like this. Regular expressions are powerful, but I don't think they're really suited to the task at hand.

    If you must stick with a regex, reduce the text to the absolute minimum that gives you unexpected results and use re 'debug'. Read the output on the following to get a feel for how it works:

    use strict; use re 'debug'; my $test = "123avc 8c7d45"; $test =~ /(c[^c]+d)/; print $1;

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: really large regex misbehaving - WTF
by japhy (Canon) on May 22, 2001 at 22:52 UTC
    I have constructed a regex which is far more readable, and does the job (on some simple test cases from your post).

    It breaks the regex into three parts: single-quoted strings, double-quoted strings, and all others. The single- and double-quoted string parts are very similar. The logic used is:

    • match a quote
    • match as many non-quote, non-backslash, non-question-mark characters as possible
    • then, as many times as possible...
      • match the \\ or ??/ escape sequence and a character, OR the ??' sequence, OR a ? that isn't part of an escape sequence
      • match as many non-quote, non-backslash, non-question-mark characters as possible
    • match the ending quote
    If that's not possible, then we use the other part.
    • one or more times, match...
      • as long as we aren't about to match a // or /*...
      • a ??' or ??/, OR a ? that's not part of an escape sequence, OR one or more non-question-marks, non-quotes, and non-whitespace
    This is a lengthy post, so...

      My tests agree that this works. Thank you so very much! I will have to learn these extended regex functions better.... Also thanks for the tip on the explain package.

      These lookahead functions appear to go beyond the computational power of traditional regular expressions. (At least, I can't think of a way to implement them fully using normal regex's.) I am starting to wonder whether I was trying to literally do the impossible, though I suspect there is a "pure" regex that could do the job.

Re: really large regex misbehaving - WTF
by Anonymous Monk on May 23, 2001 at 00:34 UTC
    try /($token)+/! the + goes to the last component of your re