I'm not very comfortable with recursive patterns, they're new in Perl 5.10 and I doubt that PHP/PCRE support them... In that case, I'd take a 2-step approach, very much like the traditional lex/yacc approach, but simplified:
- tokenize
- parse (balance braces)
1. Tokenize
Using regular expressions, you can pull out the tokens: quoted strings, words, parens/braces/brackets, other symbols. That way you will not accidently mistake braces in quoted strings for syntactically meaningful braces. Your regex engine needs to be capable of continue matching where you left off last time, in Perl you use
//g in scalar context, in Javascript you can use
//g.exec(string). Likely PCRE supports something like it in PHP, but I don't actually know.
The regex can look something like this (from the top of my head, not thoroughly tested):
/\d[\w.]*|[\w\$]+|'(?:\\?.)*'|"(?:\\?.)*"|\/\*(?s:.*?)\*\/|\/\/.*|\/(?
+:\\?.)*\/[a-z]*|\+\+|\-\-|[\n\S]/g
Note that I skip whitespace except newlines, which are meaningful in Javascript, as they can terminate the current stamement. Maybe (likely) you just don't care.
Here's some (Perl) code to test it with — load the Javascript into $_ first:
while(/(\d[\w.]*|[\w\$]+|'(?:\\?.)*'|"(?:\\?.)*"|\/\*(?s:.*?)\*\/|\/\/
+.*|\/(?:\\?.)*\/[a-z]*|\+\+|\-\-|[\n\S])/g) {
unless($1 eq "\n") {
print "Token: $1\n";
} else {
print "Newline\n";
}
}
I only display newlines differently because a bare newline as a token doesn't print so clearly.
2. Parsing – balancing braces
As you got through the tokens you extract one by one, you keep track of the nesting level: increment it if you encounter a bare "{", decrement it for a bare "}". As soon as it is decremented back to the same level as you started on for this function (usually 0, but it could be higher for nested functions), you found its end.
Here's the same code again, extended to keep track of the nesting level. As I assume the Javascript is syntactically valid, I just keep a common $level for every type of bracket, it's just simpler this way.
my $level = 0;
while(/(\d[\w.]*|[\w\$]+|'(?:\\?.)*'|"(?:\\?.)*"|\/\*(?s:.*?)\*\/|\/\/
+.*|\/(?:\\?.)*\/[a-z]*|\+\+|\-\-|[\n\S])/g) {
if($1 eq "\n") {
print "Newline\n";
} elsif(grep $1 eq $_, '(', '{', '[') {
print "Token: $1 level $level\n";
$level++;
} elsif(grep $1 eq $_, ')', '}', ']') {
$level--;
print "Token: $1 level $level\n";
print "Found the end of a top level block\n" if $level==0;
} else {
print "Token: $1\n";
}
}
This should suffice to get you started.
update
- I changed the way multiline comments (/* ... */) are handled: now the whole comment is one token. I don't know if Javascript supports nested comments, but as it is, my code doesn't support them: It just searches for the next "*/". I found nothing on the internet about them, so I suppose, if allowed, that they are very rare.
- I had forgotten about regexes. Added, handled the same like quoted strings (with "/" as delimiter, backslash escapes anything) but with a possible suffix of lower case letters for the modifiers.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.