comment on

I am a monk in training, and I have a regular expression that matches on any text that would be considered a token by a C++ compiler for the purpose of counting lines of code vs comments. Matching against the regex itself is fine; matching against the regex "+" breaks. The conditions for a token string for my purposes are:

EITHER:

at least one character long
no whitespace
no single or double quote marks (keeping in mind that the trigraph ??' is actually a ^)
no C or C++ comments (keeping in mind that the trigraph ??/ is actually a \)

OR:

begins and ends with either ' or " (and ending with the same one it begins with)
escaped quote marks are allowed inside the quoted string, i.e. \', \", ??/', and ??/" are OK
trigraph ??' does not close a single-quoted string

Having somewhat of a background in computing theory, I used state machines to derive "pure" regular expressions (i.e. no lookahead, stateful matching, or anything like that) and ended up with this for a token:

(((\?+|(\?\/|\/)(\?\/)*\?*)|([^\'\"\/\s\?]|((\?|\/)\?+[^\s\?\"])|((\?\
+/|\/)(\?\/)*([^\'\"\/\s\?\*]|\?(\?+[^\?\"\s]|[^\'\"\/\s\?]))))([^\'\"
+\/\s\?]|\?\?+[^\?\"\s]|(\/|\?\/)(\?\/)*([^\'\"\/\s\?\*]|\?(\?+[^\?\"\
+s]|[^\'\"\/\s\?])))*((\?+|(\/|\?\/)(\?\/)*\?*)?))|((\'([^\'?\\]|\\.|\
+?(\?+(\/.|[^?\/])))*(\'|\?\'))|(\"([^\"\\?]|\\.|\?\?+(\/.|[^\?\"\/]))
+*(\"|\?+\"))))
[download]

And as far as I can tell, it works. I store this horrible thing in $token and match m/^$token$/ printing out "match" if it matches and "mismatch" if it doesn't.

while( $line = <> )
{
  if( $line =~ m/^$token$/ )
  {
    print "match\n";
  }
  else
  {
    print "mismatch\n";
  }
}
[download]

Sample output:

x
match
/*x*/
mismatch
"x"
match
"\"
mismatch
"\""
match
"/*x*/"
match
"x"/*x*/
mismatch
/*x*/x
mismatch
x/*x*/
mismatch
//x
mismatch
x="2";
mismatch
[download]

All of these are expected behavior. The x="2"; is actually three tokens in sequence. (I did this to get around the fact that /* and // can occur within strings.) However, if I instead match m/^$token+$/ then comments are erroneously matched.

(blank)
mismatch
x
match
x="2";
match
/*x*/
match
//x
match
[download]

I cannot figure out why this happens. The parenthesis in the regular expression above are balanced, and the entire $token variable is inside a parenthetical. I have considered that something like (regex1)|(regex2)+ might be happening, but that is not the case according to my parenthesis counting.

Please help! This task is maddening enough as is. Is there some size limit to regex's? If anyone suspects a bug, I am using Perl 5.005_03 built for PARISC1.1.

Edited 2001-05-23 by Ovid

In reply to really large regex misbehaving by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.