The following regex code to remove comments from a JavaScript chunk of code
was developed with the help of several awesome Perl monks at this site...
The Code
#---------------------------------------------------------------------
+-
# Here is the fundamental code to match JavaScript code.
# This includes regular expressions and quoted strings.
#---------------------------------------------------------------------
+-
my ($regexJSCode) = qr{
# First, we'll list things we want
# to match, but not throw away
(?:
# Match a regular expression (they start with ( or =).
# Then the have a slash, and end with a slash.
# The first slash must not be followed by * and cannot contain
# newline chars. eg: var "re = /\*/;" or "a = b.match (/x/);"
[\(=] \s*
/
(?:
# char class contents
\[ \^? ]? (?: [^]\\]+ | \\. )* ]
|
# escaped and regular chars (\/ and \.)
(?: [^[\\\/]+ | \\. )*
)*
/
(?:
[gi]*
# next characters are not word characters
(?= [^\w] )
)
)
| # or double quoted string
(?:
"[^"\\]* (?:\\.[^"\\]*)*" [^"'/]*
)+
| # or single quoted constant
(?:
'[^'\\]* (?:\\.[^'\\]*)*' [^"'/]*
)+
}x;
#---------------------------------------------------------------------
+-
# Here is the fundamental code to match JavaScript comments and commen
+t blocks.
#---------------------------------------------------------------------
+-
my ($regexJSComments) = qr{
# or we'll match a comment. Since it's not in the
# $1 parentheses above, the comments will disappear
# when we use $1 as the replacement text.
/ # (all comments start with a slash)
(?:
# traditional C comments
(?:
\* [^*]* \*+
(?: [^/*] [^*]* \*+ )*
/
)
| # or C++ //-style comments
(?:
/ [^\n]*
)
)
}x;
#---------------------------------------------------------------------
+-
# Get rid of all comments from the string.
#---------------------------------------------------------------------
+-
$strOutput =~ s{
( $regexJSCode ) | $regexJSComments
}{$1}gsx;
Input (Problems)
function test (str) {
// A comment.
alert ("test");
var reForwardSlash = /\//;
var reBackslash = /\\/;
if (str.match(regexForwardslash) && str.match(regexBackslash)) {
return true;
}
}
Parsed Output (Incorrect)
The choking occurs on the regexForwardSlash variable:
function test (str) {
alert ("test");
var reForwardSlash = /\
var reBackslash = /\\/;
if (str.match(regexForwardslash) && str.match(regexBackslash))
+ {
return true;
}
}
If we get rid of the alert ("test") string,
we will get the proper parsing... so the regex we have developed has some issues...
Here's a successful parse with the same regex, just different input.
Input (No Problems)
function test (str) {
// A comment.
var reForwardSlash = /\//;
var reBackslash = /\\/;
if (str.match(regexForwardslash) && str.match(regexBackslash)) {
return true;
}
}
Parsed Output (Correct)
This is the expected parse output.
function test (str) {
var reForwardSlash = /\//;
var reBackslash = /\\/;
if (str.match(regexForwardslash) && str.match(regexBackslash))
+ {
return true;
}
}
Help Wanted
So as you can see, I'm doing something wrong in the regex... Does anyone see what the problem is?
Any help is greatly appreciated.