http://qs1969.pair.com?node_id=679974


in reply to Recursive Regular Expression Help

I'm not very comfortable with recursive patterns, they're new in Perl 5.10 and I doubt that PHP/PCRE support them... In that case, I'd take a 2-step approach, very much like the traditional lex/yacc approach, but simplified:
  1. tokenize
  2. parse (balance braces)

1. Tokenize

Using regular expressions, you can pull out the tokens: quoted strings, words, parens/braces/brackets, other symbols. That way you will not accidently mistake braces in quoted strings for syntactically meaningful braces. Your regex engine needs to be capable of continue matching where you left off last time, in Perl you use //g in scalar context, in Javascript you can use //g.exec(string). Likely PCRE supports something like it in PHP, but I don't actually know.

The regex can look something like this (from the top of my head, not thoroughly tested):

/\d[\w.]*|[\w\$]+|'(?:\\?.)*'|"(?:\\?.)*"|\/\*(?s:.*?)\*\/|\/\/.*|\/(? +:\\?.)*\/[a-z]*|\+\+|\-\-|[\n\S]/g
Note that I skip whitespace except newlines, which are meaningful in Javascript, as they can terminate the current stamement. Maybe (likely) you just don't care.

Here's some (Perl) code to test it with — load the Javascript into $_ first:

while(/(\d[\w.]*|[\w\$]+|'(?:\\?.)*'|"(?:\\?.)*"|\/\*(?s:.*?)\*\/|\/\/ +.*|\/(?:\\?.)*\/[a-z]*|\+\+|\-\-|[\n\S])/g) { unless($1 eq "\n") { print "Token: $1\n"; } else { print "Newline\n"; } }
I only display newlines differently because a bare newline as a token doesn't print so clearly.

2. Parsing – balancing braces

As you got through the tokens you extract one by one, you keep track of the nesting level: increment it if you encounter a bare "{", decrement it for a bare "}". As soon as it is decremented back to the same level as you started on for this function (usually 0, but it could be higher for nested functions), you found its end.

Here's the same code again, extended to keep track of the nesting level. As I assume the Javascript is syntactically valid, I just keep a common $level for every type of bracket, it's just simpler this way.

my $level = 0; while(/(\d[\w.]*|[\w\$]+|'(?:\\?.)*'|"(?:\\?.)*"|\/\*(?s:.*?)\*\/|\/\/ +.*|\/(?:\\?.)*\/[a-z]*|\+\+|\-\-|[\n\S])/g) { if($1 eq "\n") { print "Newline\n"; } elsif(grep $1 eq $_, '(', '{', '[') { print "Token: $1 level $level\n"; $level++; } elsif(grep $1 eq $_, ')', '}', ']') { $level--; print "Token: $1 level $level\n"; print "Found the end of a top level block\n" if $level==0; } else { print "Token: $1\n"; } }

This should suffice to get you started.

update

  1. I changed the way multiline comments (/* ... */) are handled: now the whole comment is one token. I don't know if Javascript supports nested comments, but as it is, my code doesn't support them: It just searches for the next "*/". I found nothing on the internet about them, so I suppose, if allowed, that they are very rare.
  2. I had forgotten about regexes. Added, handled the same like quoted strings (with "/" as delimiter, backslash escapes anything) but with a possible suffix of lower case letters for the modifiers.