Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Extracting C-Style Comments (Revisited)

by Incognito (Pilgrim)
on Feb 18, 2002 at 21:06 UTC ( [id://146247]=perlquestion: print w/replies, xml ) Need Help??

Incognito has asked for the wisdom of the Perl Monks concerning the following question:

I'm having a heck of a time with this problem... I'm hoping someone out there has the solution. This hybrid regular expression (taken from Mastering Regular Expressions and user input on this site in this article), is used to remove all comments from a JavaScript file.

For 95% of the scripts out there, this works.... The first problem was when the file contained Regex code... we weren't stripping the comments correctly. We solved that by putting into the regular expression a branch of code to match regexes that we wanted. The problem we are facing now is that certain regular expressions are obviously not being matched, and thus the file doesn't get stripped properly. Can someone either (a) tell me what is wrong with the regex, or (b) provide me with a regex that will successfully parse a JavaScript file? The one provided below simply doesn't cut it.

$strOutput =~ s{ # First, we'll list things we want # to match, but not throw away ( (?: # Match RegExp [\(=]\s* # start with ( or = / [^\r\n\*\/][^\r\n\/]* / # All RegExps start and end # with slash, but first one # must not be followed by * # and cannot contain newline # chars # # var re = /\*/; # a = b.match (/x/); ) | # -or- [^"'/]+ # other stuff | # -or- (?:"[^"\\]*(?:\\.[^"\\]*)*" [^"'/]*)+ # double quoted string | # -or- (?:'[^'\\]*(?:\\.[^'\\]*)*' [^"'/]*)+ # single quoted constant ) | # or we'll match a comment. Since it's not in the # $1 parentheses above, the comments will disappear # when we use $1 as the replacement text. / # (all comments start with a slash) (?: \*[^*]*\*+(?:[^/*][^*]*\*+)*/ # traditional C comments | # -or- /[^\n]* # C++ //-style comments ) }{$1}gsx;
I tried rewriting this regex with better code (from Japhy to match a JavaScript Regex - and works quite well on its own - but this doesn't work either for this use:
$strOutput =~ s{ # First, we'll list things we want # to match, but not throw away ( # Match a regular expression (they start with ( or =). # Then the have a slash, and end with a slash. # The first slash must not be followed by * and cannot contain # newline chars. eg: var "re = /\*/;" or "a = b.match (/x/);" (?: [\(=] \s* / (?: # char class contents \[ \^? ]? (?: [^]\\]+ | \\. )* ] | # escaped and regular chars (\/ and \.) (?: [^[\\\/]+ | \\. )* )* /[gi]* ) | # or other stuff (?: [^"'/]+ ) | # or double quoted string (?: "[^"\\]* (?:\\.[^"\\]*)*" [^"'/]* )+ | # or single quoted constant (?: '[^'\\]* (?:\\.[^'\\]*)*' [^"'/]* )+ ) | # or we'll match a comment. Since it's not in the # $1 parentheses above, the comments will disappear # when we use $1 as the replacement text. / # (all comments start with a slash) (?: # traditional C comments (?: \* [^*]* \*+ (?: [^/*] [^*]* \*+ )* / ) | # or C++ //-style comments (?: / [^\n]* ) ) }{$1}gsx;

Sample Code

In this sample file, we have a variety of regular expressions and comments. It doesn't get parsed correctly with either regex I have written. The first function header gets parsed, but the second one doesn't... And most // comments at the end of a line with a regex don't get parsed either...
/*==================================================================== +======= ' Subroutine: None ' Description: None. '===================================================================== +=====*/ function SimpleHTMLEncode (strHTMLToEncode) { var strOutput = strHTMLToEncode; if (! strOutput) { return; } strOutput = strOutput.replace(/"/gi, "&#34;"); // aka &quot; // strOutput = strOutput.replace(/&/gi, "&#38;"); // aka &amp; strOutput = strOutput.replace(/'/gi, "&#39;"); // blah return (strOutput); } /*==================================================================== +======= ' Subroutine: GetAddRolesArray '===================================================================== +=====*/ function GetAddRolesArray() { return (BuildAddRolesObject (oRHS)); } /* This is a C-style comment */ // This is a comment. function HelpMe () { var regex = /big'fat/; // comment var regex = /\\/; // comment var reMatch = mystring.match(/asf'asfs/); // comment var reMatch = mystring.match(/[/\\*?"<>\:~|]/gi); // comment var reSearch = mystring.search(objRegex); // comment var reSplit = mystring.split("\\"); // comment } /* Test1 */

Parse File Output

This is the typical output I get... notice that some comments were parsed out, and others were left.
function SimpleHTMLEncode (strHTMLToEncode) { var strOutput = strHTMLToEncode; if (! strOutput) { return; } strOutput = strOutput.replace(/"/gi, "&#34;"); // aka &quot; // strOutput = strOutput.replace(/&/gi, "&#38;"); // aka &amp; strOutput = strOutput.replace(/'/gi, "&#39;"); // blah return (strOutput); } /*==================================================================== +======= ' Subroutine: GetAddRolesArray '===================================================================== +=====*/ function GetAddRolesArray() { return (BuildAddRolesObject (oRHS)); } /* This is a C-style comment */ // This is a comment. function HelpMe () { var regex = /big'fat/; // comment var regex = /\\/; // comment var reMatch = mystring.match(/asf'asfs/); // comment var reMatch = mystring.match(/[/\\*?"<>\:~|]/gi); var reSearch = mystring.search(objRegex); var reSplit = mystring.split("\\"); + }

I hope that someone can figure this out, because I'm at the point where I'm just wasting time, trying to rewrite a regex that is nearly out of my league.

Replies are listed 'Best First'.
Re: Extracting C-Style Comments (Revisited)
by chipmunk (Parson) on Feb 18, 2002 at 21:54 UTC
    There are two things that are tripping you up. The first is the greediness of [^"'/]+. The part of the regex that matches a regular expression looks for an equal sign or left paren followed by a slash; unfortunately, the equal sign or left paren has already been gobbled up! You could fix this by adding = and ( to the character class. On the other hand, since you're substituting in place, you don't even need that part of the regex. Just remove [^"'/]+ | and it should work fine.

    The other problem is this curious regex in the JS: mystring.match(/[/\\*?"<>\:~|]/gi);. That regex would not be valid in Perl, because it contains an unescaped forward slash. Is it really valid in JavaScript? If so, you'll need to extend your regex so that it allows unescaped slashes within square brackets.

      Yes, the greediness of [^"'/]+ was definitely the problem... The new regular expression to strip of C-Style comments from a JavaScript file is:
      $strOutput =~ s{ # First, we'll list things we want # to match, but not throw away ( # Match a regular expression (they start with ( or =). # Then the have a slash, and end with a slash. # The first slash must not be followed by * and cannot contain # newline chars. eg: var "re = /\*/;" or "a = b.match (/x/);" (?: [\(=] \s* / (?: # char class contents \[ \^? ]? (?: [^]\\]+ | \\. )* ] | # escaped and regular chars (\/ and \.) (?: [^[\\\/]+ | \\. )* )* /[gi]* ) | # or double quoted string (?: "[^"\\]* (?:\\.[^"\\]*)*" [^"'/]* )+ | # or single quoted constant (?: '[^'\\]* (?:\\.[^'\\]*)*' [^"'/]* )+ ) | # or we'll match a comment. Since it's not in the # $1 parentheses above, the comments will disappear # when we use $1 as the replacement text. / # (all comments start with a slash) (?: # traditional C comments (?: \* [^*]* \*+ (?: [^/*] [^*]* \*+ )* / ) | # or C++ //-style comments (?: / [^\n]* ) ) }{$1}gsx;
      I'll do some further testing, but it looks like this huge regex will do the trick! Thanks and ++ to you chipmunk.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://146247]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-03-28 18:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found