Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Extracting C-Style Comments (Revisited Again)

by Incognito (Pilgrim)
on Mar 06, 2002 at 00:56 UTC ( [id://149574]=perlquestion: print w/replies, xml ) Need Help??

Incognito has asked for the wisdom of the Perl Monks concerning the following question:

The following regex code to remove comments from a JavaScript chunk of code was developed with the help of several awesome Perl monks at this site...

The Code

#--------------------------------------------------------------------- +- # Here is the fundamental code to match JavaScript code. # This includes regular expressions and quoted strings. #--------------------------------------------------------------------- +- my ($regexJSCode) = qr{ # First, we'll list things we want # to match, but not throw away (?: # Match a regular expression (they start with ( or =). # Then the have a slash, and end with a slash. # The first slash must not be followed by * and cannot contain # newline chars. eg: var "re = /\*/;" or "a = b.match (/x/);" [\(=] \s* / (?: # char class contents \[ \^? ]? (?: [^]\\]+ | \\. )* ] | # escaped and regular chars (\/ and \.) (?: [^[\\\/]+ | \\. )* )* / (?: [gi]* # next characters are not word characters (?= [^\w] ) ) ) | # or double quoted string (?: "[^"\\]* (?:\\.[^"\\]*)*" [^"'/]* )+ | # or single quoted constant (?: '[^'\\]* (?:\\.[^'\\]*)*' [^"'/]* )+ }x; #--------------------------------------------------------------------- +- # Here is the fundamental code to match JavaScript comments and commen +t blocks. #--------------------------------------------------------------------- +- my ($regexJSComments) = qr{ # or we'll match a comment. Since it's not in the # $1 parentheses above, the comments will disappear # when we use $1 as the replacement text. / # (all comments start with a slash) (?: # traditional C comments (?: \* [^*]* \*+ (?: [^/*] [^*]* \*+ )* / ) | # or C++ //-style comments (?: / [^\n]* ) ) }x; #--------------------------------------------------------------------- +- # Get rid of all comments from the string. #--------------------------------------------------------------------- +- $strOutput =~ s{ ( $regexJSCode ) | $regexJSComments }{$1}gsx;

Input (Problems)

function test (str) { // A comment. alert ("test"); var reForwardSlash = /\//; var reBackslash = /\\/; if (str.match(regexForwardslash) && str.match(regexBackslash)) { return true; } }

Parsed Output (Incorrect)

The choking occurs on the regexForwardSlash variable:

function test (str) { alert ("test"); var reForwardSlash = /\ var reBackslash = /\\/; if (str.match(regexForwardslash) && str.match(regexBackslash)) + { return true; } }

If we get rid of the alert ("test") string, we will get the proper parsing... so the regex we have developed has some issues... Here's a successful parse with the same regex, just different input.

Input (No Problems)

function test (str) { // A comment. var reForwardSlash = /\//; var reBackslash = /\\/; if (str.match(regexForwardslash) && str.match(regexBackslash)) { return true; } }

Parsed Output (Correct)

This is the expected parse output.

function test (str) { var reForwardSlash = /\//; var reBackslash = /\\/; if (str.match(regexForwardslash) && str.match(regexBackslash)) + { return true; } }

Help Wanted

So as you can see, I'm doing something wrong in the regex... Does anyone see what the problem is? Any help is greatly appreciated.

Replies are listed 'Best First'.
Re: Extracting C-Style Comments (Revisited Again)
by chipmunk (Parson) on Mar 06, 2002 at 02:42 UTC
    It took longer than I'd like to admit to figure out the problem this time. :)
    | # or double quoted string (?: "[^"\\]* (?:\\.[^"\\]*)*" [^"'/]* )
    This matches a double-quoted string, then some amount of code after the double-quoted string. [^"'/]* will match everything up to and including the open parenthesis or equal sign that you are relying on to match as the beginning of the JS regular expression. Simply remove that bit from your regex (after the single-quoted string match as well) and the JS code snippet will be parsed properly.

      Excellent! Another ++ to you!!! I actually understood your answer for once, which is great... short and to the point... Here's the fully updated regex code for those that are interested...

      #--------------------------------------------------------------------- +- # Here is the fundamental code to match JavaScript code. # This includes regular expressions and quoted strings. #--------------------------------------------------------------------- +- my ($regexJSCode) = qr{ # First, we'll list things we want # to match, but not throw away (?: # Match a regular expression (they start with ( or =). # Then the have a slash, and end with a slash. # The first slash must not be followed by * and cannot contain # newline chars. eg: var "re = /\*/;" or "a = b.match (/x/);" [\(=] \s* / (?: # char class contents \[ \^? ]? (?: [^]\\]+ | \\. )* ] | # escaped and regular chars (\/ and \.) (?: [^[\\\/]+ | \\. )* )* / (?: [gi]* # next characters are not word characters (?= [^\w] ) ) ) | # or double quoted string (?: "[^"\\]* (?:\\.[^"\\]*)*" )+ | # or single quoted constant (?: '[^'\\]* (?:\\.[^'\\]*)*' )+ }x; #--------------------------------------------------------------------- +- # Here is the fundamental code to match JavaScript comments and commen +t blocks. #--------------------------------------------------------------------- +- my ($regexJSComments) = qr{ # or we'll match a comment. Since it's not in the # $1 parentheses above, the comments will disappear # when we use $1 as the replacement text. / # (all comments start with a slash) (?: # traditional C comments (?: \* [^*]* \*+ (?: [^/*] [^*]* \*+ )* / ) | # or C++ //-style comments (?: / [^\n]* ) ) }x; #--------------------------------------------------------------------- +- # Get rid of all comments from the string. #--------------------------------------------------------------------- +- $strOutput =~ s{ ( $regexJSCode ) | $regexJSComments }{$1}gsx;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://149574]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (7)
As of 2024-04-18 11:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found