Incognito has asked for the wisdom of the Perl Monks concerning the following question:
I'm having a heck of a time with this problem... I'm hoping someone out there has the solution. This hybrid regular expression (taken from Mastering Regular Expressions and user input on this site in this article), is used to remove all comments from a JavaScript file.
For 95% of the scripts out there, this works.... The first problem was when the file contained Regex code... we weren't stripping the comments correctly. We solved that by putting into the regular expression a branch of code to match regexes that we wanted. The problem we are facing now is that certain regular expressions are obviously not being matched, and thus the file doesn't get stripped properly. Can someone either (a) tell me what is wrong with the regex, or (b) provide me with a regex that will successfully parse a JavaScript file? The one provided below simply doesn't cut it.
I tried rewriting this regex with better code (from Japhy to match a JavaScript Regex - and works quite well on its own - but this doesn't work either for this use:$strOutput =~ s{ # First, we'll list things we want # to match, but not throw away ( (?: # Match RegExp [\(=]\s* # start with ( or = / [^\r\n\*\/][^\r\n\/]* / # All RegExps start and end # with slash, but first one # must not be followed by * # and cannot contain newline # chars # # var re = /\*/; # a = b.match (/x/); ) | # -or- [^"'/]+ # other stuff | # -or- (?:"[^"\\]*(?:\\.[^"\\]*)*" [^"'/]*)+ # double quoted string | # -or- (?:'[^'\\]*(?:\\.[^'\\]*)*' [^"'/]*)+ # single quoted constant ) | # or we'll match a comment. Since it's not in the # $1 parentheses above, the comments will disappear # when we use $1 as the replacement text. / # (all comments start with a slash) (?: \*[^*]*\*+(?:[^/*][^*]*\*+)*/ # traditional C comments | # -or- /[^\n]* # C++ //-style comments ) }{$1}gsx;
$strOutput =~ s{ # First, we'll list things we want # to match, but not throw away ( # Match a regular expression (they start with ( or =). # Then the have a slash, and end with a slash. # The first slash must not be followed by * and cannot contain # newline chars. eg: var "re = /\*/;" or "a = b.match (/x/);" (?: [\(=] \s* / (?: # char class contents \[ \^? ]? (?: [^]\\]+ | \\. )* ] | # escaped and regular chars (\/ and \.) (?: [^[\\\/]+ | \\. )* )* /[gi]* ) | # or other stuff (?: [^"'/]+ ) | # or double quoted string (?: "[^"\\]* (?:\\.[^"\\]*)*" [^"'/]* )+ | # or single quoted constant (?: '[^'\\]* (?:\\.[^'\\]*)*' [^"'/]* )+ ) | # or we'll match a comment. Since it's not in the # $1 parentheses above, the comments will disappear # when we use $1 as the replacement text. / # (all comments start with a slash) (?: # traditional C comments (?: \* [^*]* \*+ (?: [^/*] [^*]* \*+ )* / ) | # or C++ //-style comments (?: / [^\n]* ) ) }{$1}gsx;
/*==================================================================== +======= ' Subroutine: None ' Description: None. '===================================================================== +=====*/ function SimpleHTMLEncode (strHTMLToEncode) { var strOutput = strHTMLToEncode; if (! strOutput) { return; } strOutput = strOutput.replace(/"/gi, """); // aka " // strOutput = strOutput.replace(/&/gi, "&"); // aka & strOutput = strOutput.replace(/'/gi, "'"); // blah return (strOutput); } /*==================================================================== +======= ' Subroutine: GetAddRolesArray '===================================================================== +=====*/ function GetAddRolesArray() { return (BuildAddRolesObject (oRHS)); } /* This is a C-style comment */ // This is a comment. function HelpMe () { var regex = /big'fat/; // comment var regex = /\\/; // comment var reMatch = mystring.match(/asf'asfs/); // comment var reMatch = mystring.match(/[/\\*?"<>\:~|]/gi); // comment var reSearch = mystring.search(objRegex); // comment var reSplit = mystring.split("\\"); // comment } /* Test1 */
function SimpleHTMLEncode (strHTMLToEncode) { var strOutput = strHTMLToEncode; if (! strOutput) { return; } strOutput = strOutput.replace(/"/gi, """); // aka " // strOutput = strOutput.replace(/&/gi, "&"); // aka & strOutput = strOutput.replace(/'/gi, "'"); // blah return (strOutput); } /*==================================================================== +======= ' Subroutine: GetAddRolesArray '===================================================================== +=====*/ function GetAddRolesArray() { return (BuildAddRolesObject (oRHS)); } /* This is a C-style comment */ // This is a comment. function HelpMe () { var regex = /big'fat/; // comment var regex = /\\/; // comment var reMatch = mystring.match(/asf'asfs/); // comment var reMatch = mystring.match(/[/\\*?"<>\:~|]/gi); var reSearch = mystring.search(objRegex); var reSplit = mystring.split("\\"); + }
I hope that someone can figure this out, because I'm at the point where I'm just wasting time, trying to rewrite a regex that is nearly out of my league.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Extracting C-Style Comments (Revisited)
by chipmunk (Parson) on Feb 18, 2002 at 21:54 UTC | |
by Incognito (Pilgrim) on Feb 18, 2002 at 23:58 UTC |