I'm having a heck of a time with this problem... I'm hoping someone out there has the solution.
This hybrid regular expression (taken from Mastering Regular Expressions and user input on this site in
this article),
is used to remove all comments from a JavaScript file.
For 95% of the scripts out there, this works....
The first problem was when the file contained Regex code... we weren't stripping the comments correctly.
We solved that by putting into the regular expression a branch of code to match regexes that we wanted.
The problem we are facing now is that certain regular expressions are obviously not being matched, and
thus the file doesn't get stripped properly.
Can someone either (a) tell me what is wrong with the regex, or (b) provide me with a regex that will
successfully parse a JavaScript file? The one provided below simply doesn't cut it.
$strOutput =~ s{ # First, we'll list things we want
# to match, but not throw away
(
(?: # Match RegExp
[\(=]\s* # start with ( or =
/ [^\r\n\*\/][^\r\n\/]* / # All RegExps start and end
# with slash, but first one
# must not be followed by *
# and cannot contain newline
# chars
#
# var re = /\*/;
# a = b.match (/x/);
)
| # -or-
[^"'/]+ # other stuff
| # -or-
(?:"[^"\\]*(?:\\.[^"\\]*)*" [^"'/]*)+ # double quoted string
| # -or-
(?:'[^'\\]*(?:\\.[^'\\]*)*' [^"'/]*)+ # single quoted constant
)
|
# or we'll match a comment. Since it's not in the
# $1 parentheses above, the comments will disappear
# when we use $1 as the replacement text.
/ # (all comments start with a slash)
(?:
\*[^*]*\*+(?:[^/*][^*]*\*+)*/ # traditional C comments
| # -or-
/[^\n]* # C++ //-style comments
)
}{$1}gsx;
I tried rewriting this regex with better code (from
Japhy to match a JavaScript Regex - and works quite
well on its own - but this doesn't work either for this use:
$strOutput =~ s{ # First, we'll list things we want
# to match, but not throw away
(
# Match a regular expression (they start with ( or =).
# Then the have a slash, and end with a slash.
# The first slash must not be followed by * and cannot contain
# newline chars. eg: var "re = /\*/;" or "a = b.match (/x/);"
(?:
[\(=] \s*
/
(?:
# char class contents
\[ \^? ]? (?: [^]\\]+ | \\. )* ]
|
# escaped and regular chars (\/ and \.)
(?: [^[\\\/]+ | \\. )*
)*
/[gi]*
)
| # or other stuff
(?:
[^"'/]+
)
| # or double quoted string
(?:
"[^"\\]* (?:\\.[^"\\]*)*" [^"'/]*
)+
| # or single quoted constant
(?:
'[^'\\]* (?:\\.[^'\\]*)*' [^"'/]*
)+
)
|
# or we'll match a comment. Since it's not in the
# $1 parentheses above, the comments will disappear
# when we use $1 as the replacement text.
/ # (all comments start with a slash)
(?:
# traditional C comments
(?:
\* [^*]* \*+
(?: [^/*] [^*]* \*+ )*
/
)
| # or C++ //-style comments
(?:
/ [^\n]*
)
)
}{$1}gsx;
Sample Code
In this sample file, we have a variety of regular expressions and comments.
It doesn't get parsed correctly with either regex I have written. The first function header
gets parsed, but the second one doesn't... And most // comments at the end
of a line with a regex don't get parsed either...
/*====================================================================
+=======
' Subroutine: None
' Description: None.
'=====================================================================
+=====*/
function SimpleHTMLEncode (strHTMLToEncode) {
var strOutput = strHTMLToEncode;
if (! strOutput) {
return;
}
strOutput = strOutput.replace(/"/gi, """); // aka "
// strOutput = strOutput.replace(/&/gi, "&"); // aka &
strOutput = strOutput.replace(/'/gi, "'"); // blah
return (strOutput);
}
/*====================================================================
+=======
' Subroutine: GetAddRolesArray
'=====================================================================
+=====*/
function GetAddRolesArray() {
return (BuildAddRolesObject (oRHS));
}
/*
This is a C-style comment
*/
// This is a comment.
function HelpMe () {
var regex = /big'fat/; // comment
var regex = /\\/; // comment
var reMatch = mystring.match(/asf'asfs/); // comment
var reMatch = mystring.match(/[/\\*?"<>\:~|]/gi); // comment
var reSearch = mystring.search(objRegex); // comment
var reSplit = mystring.split("\\"); // comment
}
/*
Test1
*/
Parse File Output
This is the typical output I get... notice that some comments were parsed out, and others were left.
function SimpleHTMLEncode (strHTMLToEncode) {
var strOutput = strHTMLToEncode;
if (! strOutput) {
return;
}
strOutput = strOutput.replace(/"/gi, """); // aka "
// strOutput = strOutput.replace(/&/gi, "&"); // aka &
strOutput = strOutput.replace(/'/gi, "'"); // blah
return (strOutput);
}
/*====================================================================
+=======
' Subroutine: GetAddRolesArray
'=====================================================================
+=====*/
function GetAddRolesArray() {
return (BuildAddRolesObject (oRHS));
}
/*
This is a C-style comment
*/
// This is a comment.
function HelpMe () {
var regex = /big'fat/; // comment
var regex = /\\/; // comment
var reMatch = mystring.match(/asf'asfs/); // comment
var reMatch = mystring.match(/[/\\*?"<>\:~|]/gi);
var reSearch = mystring.search(objRegex);
var reSplit = mystring.split("\\");
+
}
I hope that someone can figure this out, because I'm at the point where I'm just wasting
time, trying to rewrite a regex that is nearly out of my league.