Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

C comment stripping preprocessor

by GrandFather (Saint)
on Aug 09, 2006 at 18:20 UTC ( [id://566453]=CUFP: print w/replies, xml ) Need Help??

In the process of trying to emulate the C pre-processor I had major trouble trying to handle C style /* ... */ comments. There are two issues that cause particular grief - comments can span lines and, at least for some compilers, comments can be nested (and are in the code I need to handle).

An additional gotcha is that things that look like comments in strings need to be retained.

The code below parses an input string and generates an output string comprising the original text sans C style comments. Note that it leaves C++ single line comments however - but they are easily dealt with in the second pass.

use strict; use warnings; use Parse::RecDescent; my $decommendedText = ''; sub concat ($) {$decommendedText .= $_[0]; 1;} my $decomment = <<'GRAMMAR'; file : block(s) block : string {::concat ($item{string}); 1} | m{((?!/\*|"|').)+}s {::concat ($item[-1]); 1} | comment {::concat ($item{comment}); 1;} string : /"([^"]|\\")*"/ {$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;} | /'([^']|\\')*'/ {$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;} comment : '/*' commentBlock '*/' {$return = $text =~ /^\n/ ? "\n" : ''; 1;} commentBlock : m{((?! \*/ | /\* ).)*}sx comment m{((?! \*/ | /\* ). +)*}sx {$return = "\n"; 1;} | m{((?! \*/ | /\* ).)+}sx {$return = ''; 1;} GRAMMAR my $parse = new Parse::RecDescent ($decomment); my $input = <<'DATA'; #include "StdAfx.h" // Tail comment #include "Utility\perftime.h" #pragma hdrstop /* Comment before MACRO */ /* Comment /* and nested comment */ lines */ #define MACRO 10\ + 3 // Multi line macro with comment #define __DEBUG /* comment */ 1 #define STRING 'This is a string' /* comment */ #define COMMENT "/* comment in \"a\" string */" // c++ comment line /* Comment at start for a number of lines */ /* multi-line comment /* nested */ block */ // cpp block char PerfTimer::Buf[64]; DATA $parse->file($input) or die "Parse failed\n"; print $decommendedText;

Prints:

#include "StdAfx.h"// Tail comment #include "Utility\perftime.h" #pragma hdrstop #define MACRO 10\ + 3 // Multi line macro with comment #define __DEBUG 1 #define STRING 'This is a string' #define COMMENT "/* comment in \"a\" string */" // c++ comment line // cpp block char PerfTimer::Buf[64];

DWIM is Perl's answer to Gödel

Replies are listed 'Best First'.
Re: C comment stripping preprocessor (problems)
by tye (Sage) on Aug 09, 2006 at 19:20 UTC

    This doesn't handle this case:

    // We do not use /*-style comments

    It doesn't even handle the case:

    // We don't use old C-style comments

    because it tries to find the closing single quote to match the apostrophe in "don't". You simply have to parse //-style comments for such a tool.

    /"([^"]|\\­")*"/

    This doesn't handle "\\". Also note that it will fail for strings of 32K characters, which is why I prefer to add the + in "([^"\\]+|\\­.)*".

    Why don't you factor out m{((?! \*/ | /\* ).)+}sx into its own rule so you don't have to repeat that regex three times and so you can assign a descriptive name to it to aid understanding?

    m{((?!/\*|­"|').)+}s could be replaced by [^/"'] and /(?![*]), which is more to my taste but YMMV.

    And I'd probably do this all with simpler regexes and a simple state machine instead of resorting to Parse::RecDescent (not that my result will be simpler code in total). Note that I even avoid having to slurp the entire input into a single string.

    #!/usr/bin/perl -w use strict; $|= 1; # Useful for ad-hoc testing my $canNest= 1; # Whether /*-style comments can be nested my $depth= 0; my $output= ""; while( <DATA> ) { while( ! m[\G\z]gc ) { while( $depth && m[/[*]|[*]/]gc ) { if( "/" eq substr( $_, $-[0], 1 ) ) { $depth++; } elsif( $canNest ) { $depth--; } else { $depth= 0; } } last if $depth; if( m[ \G (?: [^'"/]+ | ' (?: [^'\\]+ | \\. )* ' | " (?: [^"\\]+ | \\. )* " | /(?![/*]) )+ ]xgc ) { $output .= substr( $_, $-[0], $+[0] - $-[0] ); } elsif( m[\G//.*]gc ) { # skip C++ comments } elsif( m[\G/[*]]gc ) { $depth++; } elsif( m[\G['"]]gc ) { warn "Ignoring unclosed quote: $_"; } else { die $_, ' ' x pos($_), "^\nCouldn't be parsed"; } } print $output; $output= ""; } warn "$depth unclosed /*-comments\n" if $depth; __END__ #include "StdAfx.h" // Tail comment #include "Utility\perftime.h" #pragma hdrstop /* Comment before MACRO */ /* Comment /* and nested comment */ lines */ #define MACRO 10\ + 3 // Multi line macro with comment #define __DEBUG /* comment */ 1 #define STRING 'This is a string' /* comment */ #define BACKSLASH '\\' #define COMMENT "/* comment in \"a\" string */" // c++ comment line /* Comment at start for a number of lines */ /* multi-line comment /* nested */ block */ // cpp block char PerfTimer::Buf[64]; // Don't use contractions // /*-style comment below over multiple lines: test/*ing how newlines work when a comment spans lines, does it st*/ing? total/*divide*//count//*comment

    Produces

    #include "StdAfx.h" #include "Utility\perftime.h" #pragma hdrstop #define MACRO 10\ + 3 #define __DEBUG 1 #define STRING 'This is a string' #define BACKSLASH '\\' #define COMMENT "/* comment in \"a\" string */" char PerfTimer::Buf[64]; testing? total/count

    (Very minor updates applied.)

    - tye        

Re: C comment stripping preprocessor
by ikegami (Patriarch) on Aug 09, 2006 at 18:34 UTC

    I have a few little improvements.

    • strict and warnings are specified for your program, but not for the code snippets in your grammar.
    • There is an undue requirement on the user (i.e. the calling script) to provide concat.
    • $decommendedText should not be global, and the user (i.e. the calling script) should not have to initialize it.
    • There is no check for end-of-file. If the parser fails halfway through, there's no way of knowing.

    The following addresses these issues.

    use strict; use warnings; use Parse::RecDescent; my $decomment = <<'GRAMMAR'; { use strict; use warnings; sub concat { $decommendedText .= $_[0]; } } file : <rulevar: local $decommendedText = ''> | block(s) /\Z/ {$return = $decommendedText; 1;} block : string {concat ($item{string}); 1;} | m{((?!/\*|"|').)+}s {concat ($item[-1]); 1;} | comment {concat ($item{comment}); 1;} string : /"([^"]|\\")*"/ {$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;} | /'([^']|\\')*'/ {$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;} comment : '/*' commentBlock '*/' {$return = $text =~ /^\n/ ? "\n" : ''; 1;} commentBlock : m{((?! \*/ | /\* ).)*}sx comment m{((?! \*/ | /\* ). +)*}sx {$return = "\n"; 1;} | m{((?! \*/ | /\* ).)+}sx {$return = ''; 1;} GRAMMAR ... my $decommendedText = $parse->file($input); die "Parse failed\n" if not defined $decommendedText; print $decommendedText;

    Update: Now fixes $decommendedText being a global.

      Actually concat is there:

      sub concat ($) {$decommendedText .= $_[0]; 1;}

      :)

      Update: Thanks for the other tips - especially the returned result from file and eof detection.


      DWIM is Perl's answer to Gödel
        It was in main::. In the caller. Why would a module rely on the calling script to provide its internal functions? That's no good. I moved the function into the module where it should be. The problem becomes extremely evident when you Precompile (as you should).
Re: C comment stripping preprocessor
by ForgotPasswordAgain (Priest) on Aug 10, 2006 at 10:32 UTC

    C comments don't nest!

    They don't lay eggs, either... :)

      It may well be that the comments that you are familiar with are sterile and have no inclination toward nesting. However, I can assure you, that our comments are quite lively and prone to nesting. They therefore require appropriate management.

      Perhaps I should mention that these are actually C++ comments and it may be that hybrid vigor accounts for the difference in nesting behaviour?


      DWIM is Perl's answer to Gödel

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://566453]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2024-03-29 04:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found