Incognito has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to come up with a regex that will grab all variable names in the following JavaScript function/file:

function test (aaaaa) { var myTest1 = 1; var myTest2 = 2, myTest3 = 3, myTest4; var myTest5 = new Array(__QUOTE__,__QUOTE__), myTest6; var myTest7 =__REGEX__; var myTest8 = myTest5.x; var myTest9 = myTest[0], myTest10 = myTest[0]; var myTest11 = (myTest1 == myTest2); var myTest12 = (myTest1 == myTest2), myTest13 = 2; var myTest14 = (myTest1 == myTest2), myTest15; var myTest16 = new Array(1, 2); var myTest17 = new(blah), myTest18 = new(blah,blah2), myTest19; }

The __QUOTE__ and __REGEX__ strings are basically quoted strings or regexes which were previously parsed, so we don't ever have to worry about quotes or special chars in this file.

Here's the regex (with some debug code) I developed so far (where $strInput contains the entire function and its contents):

my (@localDeclaredVars) = ($strInput =~ m/\bvar\s+([^;]+)/g); foreach my $localDeclaredVar (@localDeclaredVars) { if ($localDeclaredVar =~ m/,/) { # We have multiple variables declared in one line. @localDeclaredSubVars = $localDeclaredVar =~ m{ ( # Grab the variable name \w+ ) \s* (?: # Suck up any possible values of that variable = \s* (?: # A variable with parentheses and possible comma (?: \( (?: \\. [^\)\\]* )* \) ) | (?: # A straight variable or value [^,]+ ) )* )? ,? }gx; print "\n$localDeclaredVar\n"; foreach my $localDeclaredSubVar (@localDeclaredSubVars) { if ($localDeclaredSubVar =~ m/=/) { ($localDeclaredSubVar) = ($localDeclaredSubVar =~ m/\s +*([^= ]+)\s*=/); } print " $localDeclaredSubVar\n"; push (@localVariables, $localDeclaredSubVar) if ($localDec +laredSubVar); } } else { # We have a single variable declaration. print "\n$localDeclaredVar\n"; if ($localDeclaredVar =~ m/=/) { ($localDeclaredVar) = ($localDeclaredVar =~ m/\s*([^= ]+)\ +s*=/); } print " $localDeclaredVar\n"; push (@localVariables, $localDeclaredVar) if ($localDeclaredVa +r); } }

This is the output I get... As you can see, my regular expression is having a hard time sucking up brackets with commas in them. It's grabbing most variable names fine, but having trouble when there's an Array with commas in the parentheses.

Sample output

myTest1 = 1 myTest1 myTest2 = 2, myTest3 = 3, myTest4 myTest2 myTest3 myTest4 myTest5 = new Array(__QUOTE__,__QUOTE__), myTest6 myTest5 __QUOTE__ myTest6 myTest7 =__REGEX__ myTest7 myTest8 = myTest5.x myTest8 myTest9 = myTest[0], myTest10 = myTest[0] myTest9 myTest10 myTest11 = (myTest1 == myTest2) myTest11 myTest12 = (myTest1 == myTest2), myTest13 = 2 myTest12 myTest13 myTest14 = (myTest1 == myTest2), myTest15 myTest14 myTest15 myTest16 = new Array(1, 2) myTest16 2 myTest17 = new(blah), myTest18 = new(blah,blah2), myTest19 myTest17 myTest18 blah2 myTest19

Can someone help me fix this regex so that it grabs all variable names, from myTest1 through to myTest19? Also, if there's a way I can do the regex without having to do the my (@localDeclaredVars) = ($strInput =~ m/\bvar\s+([^;]+)/g); that would be nice as well.

Replies are listed 'Best First'.
Re: Regex for stripping variable names from a JavaScript file
by dmmiller2k (Chaplain) on Feb 23, 2002 at 03:57 UTC

    I'm not sure I'd use a regex for this. Think about Parse::RecDescent or its ilk.

    dmm

    If you GIVE a man a fish you feed him for a day
    But,
    TEACH him to fish and you feed him for a lifetime

      Okay, so I have tried using Parse::RecDescent and built a small parser for var statements... I'm having only one problem...

      use Parse::RecDescent; $grammar = q { varStatement: 'var' statements endofvar statements: <leftop: statement comma statement> comma_values: <leftop: assignvalue comma assignvalue> statement: var_name (operator assignvalue)(?) assignvalue: equality | escapedRegex | escapedQuote | array_declaration | numeric_value | array_value | object_value | var_name array_declaration: 'new Array(' comma_values ')' array_value: array_name '[' integer ']' equality: '(' assignvalue equality_operator assignvalue + ')' var_name: /\w+/ { $return = "$item[1 +]" } array_name: /\w+/ object_value: /[A-Za-z0-9_.]+/ numeric_value: real_number | integer integer: /\d+/ real_number: /\d+\.?\d*/ escapedRegex: '__REGEX__' escapedQuote: '__QUOTE__' operator: '=' equality_operator: '===' | '==' | '!=' endofvar: ';' comma: ',' }; print "\n\n"; $parser = new Parse::RecDescent ($grammar) or die "*** Bad grammar!\n" +; foreach my $localDeclaredVar (@localDeclaredVars) { print "$localDeclaredVar\n"; my $test = $parser->varStatement($localDeclaredVar) or print "*** +Bad text!!!\n"; print "==>$test\n"; }

      How do we grab the matched var_name? My goal is to match each variable name that was matched... but I've read the FAQ as much as I could handle and cannot determine that small fact....

      The Input

      var myTest1 = 1; var myTest2 = 2, myTest3 = 3, myTest4; var myTest5 = new Array(__QUOTE__,__QUOTE__), myTest6; var myTest7 =__REGEX__; var myTest8 = myTest5.x; var myTest9 = myTest[0], myTest10 = myTest[0]; var myTest11 = (myTest1 == myTest2); var myTest12 = (myTest1 == myTest2), myTest13 = 2; var myTest14 = (myTest1 == myTest2), myTest15; var myTest16 = new Array(1, 2); var myTest17, myTest18;

      My Output

      var myTest1 = 1; ==>; var myTest2 = 2, myTest3 = 3, myTest4; ==>; var myTest5 = new Array(__QUOTE__,__QUOTE__), myTest6; ==>; var myTest7 =__REGEX__; ==>; var myTest8 = myTest5.x; ==>; var myTest9 = myTest[0], myTest10 = myTest[0]; ==>; var myTest11 = (myTest1 == myTest2); ==>; var myTest12 = (myTest1 == myTest2), myTest13 = 2; ==>; var myTest14 = (myTest1 == myTest2), myTest15; ==>; var myTest16 = new Array(1, 2); ==>; var myTest17, myTest18; ==>;

      As you can see, all that I get is the darn semicolon - the string that was left after all matching was successful... this is of course not what I want... Does anyone know how to solve this?

        There are two problems here. First, your grammar is not quite right. And secondly, you aren't setting the return in your starting rule.

        I'm not an expert with Parse::RecDescent, or with constructing grammars for YACC, Bison, etc.(far from it, actually); but IMHO, you probably don't want the { $return = $item[1] } on the 'var_name:' rule.

        Instead, I think you want it on the 'statement:' AND 'varStatement:' rules (see below). Also, removing the 'comma:' rule and replacing its use with literal commas, prevents getting commas in the output (apologies for not using the lingo correctly). Here's my attempt:

        use Parse::RecDescent; my $grammar = q { varStatement: 'var' statements endofvar { $return = $item[2] + } statements: <leftop: statement ',' statement> statement: var_name (operator assignvalue)(?) { $return = + $item[1] } comma_values: <leftop: assignvalue ',' assignvalue> assignvalue: equality | escapedRegex | escapedQuote | array_declaration | numeric_value | array_value | object_value | var_name array_declaration: 'new Array(' comma_values ')' array_value: array_name '[' integer ']' equality: '(' assignvalue equality_operator assignvalue +')' var_name: /\w+/ array_name: /\w+/ object_value: /[A-Za-z0-9_.]+/ numeric_value: real_number | integer integer: /\d+/ real_number: /\d+\.?\d*/ escapedRegex: '__REGEX__' escapedQuote: '__QUOTE__' operator: '=' equality_operator: '===' | '==' | '!=' endofvar: ';' }; my @localDeclaredVars = <DATA>; chomp @localDeclaredVars; print "\n\n"; $parser = new Parse::RecDescent ($grammar) or die "*** Bad grammar!\n" +; foreach my $localDeclaredVar (@localDeclaredVars) { print "$localDeclaredVar\n"; my $test = $parser->varStatement($localDeclaredVar) or print "*** Ba +d text!!!\n"; if ( ref($test) eq 'ARRAY' ) { print "==> ( @$test )\n"; } else { print "==> $test\n"; } } __END__ var myTest1 = 1; var myTest2 = 2, myTest3 = 3, myTest4; var myTest5 = new Array(__QUOTE__,__QUOTE__), myTest6; var myTest7 =__REGEX__; var myTest8 = myTest5.x; var myTest9 = myTest[0], myTest10 = myTest[0]; var myTest11 = (myTest1 == myTest2); var myTest12 = (myTest1 == myTest2), myTest13 = 2; var myTest14 = (myTest1 == myTest2), myTest15; var myTest16 = new Array(1, 2); var myTest17, myTest18;

        and here is the output:

        var myTest1 = 1; ==> ( myTest1 ) var myTest2 = 2, myTest3 = 3, myTest4; ==> ( myTest2 myTest3 myTest4 ) var myTest5 = new Array(__QUOTE__,__QUOTE__), myTest6; ==> ( myTest5 myTest6 ) var myTest7 =__REGEX__; ==> ( myTest7 ) var myTest8 = myTest5.x; ==> ( myTest8 ) var myTest9 = myTest[0], myTest10 = myTest[0]; ==> ( myTest9 myTest10 ) var myTest11 = (myTest1 == myTest2); ==> ( myTest11 ) var myTest12 = (myTest1 == myTest2), myTest13 = 2; ==> ( myTest12 myTest13 ) var myTest14 = (myTest1 == myTest2), myTest15; ==> ( myTest14 myTest15 ) var myTest16 = new Array(1, 2); ==> ( myTest16 ) var myTest17, myTest18; ==> ( myTest17 myTest18 )

        This still misses array_names and variables within parenthesized expressions, but hey, it's a step in the right direction, I suppose. How would I be helping you if I solved your whole problem for you? :) At least you now have a debuggable chuknk of code.

        dmm

        If you GIVE a man a fish you feed him for a day
        But,
        TEACH him to fish and you feed him for a lifetime