http://qs321.pair.com?node_id=11105353

So folks,

today I need to makes some sense of C++ files. I will need to parse out function signatures, and I have tried this with regex before, it gets messy especially around templates. Now a similar requirement has reared it's head so step one, lex the code. without further ado here is my attempt at lexing C++. The Lexer is called with an open file handle to a C++ source file. This is lexed into an array of tokens, that is then handed on to the parser.

What do you think, is this going to give me a nice labeled stream and make parsing a dream, or am I stumbling into know gotchas? Does it qualify as cool?

sub Lex { my $input = shift; # Get file handle to a C++ file my @tokens; # This will contain the tokenised file ready fo +r our parser my @longPatterns = ( ['Comment' => qr|//.*| ], ['Directive' => qr|^\s*#define.*| ], ['Directive' => qr|^\s*#elif.*| ], ['Directive' => qr|^\s*#else.*| ], ['Directive' => qr|^\s*#error.*| ], ['Directive' => qr|^\s*#endif.*| ], ['Directive' => qr|^\s*#if.*| ], ['Directive' => qr|^\s*#ifdef.*| ], ['Directive' => qr|^\s*#ifndef.*| ], ['Directive' => qr|^\s*#include.*| ], ['Directive' => qr|^\s*#line.*| ], ['Directive' => qr|^\s*#undef.*| ], ['Directive' => qr|^\s*#pragma.*| ], ); my @reserved = qw( alignas alignof and and_eq asm atomic_cancel atomic_commit ato +mic_noexcept auto bitand bitor bool break case catch char char16_t char32_t class compl conce +pt const constexpr const_cast continue co_await co_return co_yield decltype default delete d +o double dynamic_cast else enum explicit export extern false float for friend goto if imp +ort inline int long module mutable namespace new noexcept not not_eq nullptr operator or +or_eq private protected public register reinterpret_cast requires return short signed sizeof +static static_assert static_cast struct switch synchronized template this thread_lo +cal throw true try typedef typeid typename union unsigned using virtual void volatile wch +ar_t while xor xor_eq ); my @patterns = ( # Multi character patterns to lex out ['Number' => qr/^\d[\.\d]*$/ ], ['Identifier' => qr/\w+/ ], ['dblColon' => qr/(?<!:)::(?!:)/ ], ); my %Character = ( # Single characters by name '(' => 'LeftParen', ')' => 'RightParen', '[' => 'LeftSquare', ']' => 'RightSquare', '{' => 'LeftCurly', '}' => 'RightCurly', '<' => 'LessThan', '>' => 'GreaterThan', '=' => 'Equal', '+' => 'Plus', '-' => 'Minus', '*' => 'Asterisk', '/' => 'Slash', '#' => 'Hash', '.' => 'Dot', ',' => 'Comma', ':' => 'Colon', ';' => 'Semicolon', "'" => 'SingleQuote', '"' => 'DoubleQuote', '|' => 'Pipe', ); while (my $line = <$input>) { chomp $line; my $matched; for my $patt (@longPatterns) { # some to evaluate on the entire li +ne if ($line =~ s|($patt->[1])|| ) { my $token = $1; print "$patt->[0]\t$token\n" if $debug; push @tokens, [$patt->[0], $token]; } } print "got> $line\n" if $debug and $line =~/\S/; LABEL: for my $token (split /\b/, $line) { # now handle token at a + time $token =~ s/^\s+|\s+$//g; # Strip whitespace next unless $token; # anything left? print "Lexing $token\n" if $debug; for my $word (@reserved) { # look for reserve words if ($word eq $token) { # A C++ reserve word, simples print "reserved\t$token\n" if $debug; push @tokens, ['reserved', $token]; next LABEL; } } for my $pat (@patterns) { # Try multi character patterns n +ext if ($token =~ /$pat->[1]/) { print "$pat->[0]\t$token\n" if $debug; push @tokens, [$pat->[0], $token]; next LABEL } } unless ($matched) { # Didn't match multichar pattern, so h +andle character at a time for my $char (split //, $token) { print "Lexing by character $char\n" if $debug; if (exists $Character{$char}) { print "$Character{$char}\t$char\n" if $debug; push @tokens, [$Character{$char}, $char]; } else { print "Failed to match $char\n"; } } } } Parser(\@tokens) } }

Cheers,
R.

Pereant, qui ante nos nostra dixerunt!

Update

More compiler directives added

Replies are listed 'Best First'.
Re: Lexing C++
by tybalt89 (Monsignor) on Sep 01, 2019 at 16:33 UTC

    I like doing this kind of thing all in one regex. This is incomplete, but should be enough to give you the basic idea.

    #!/usr/bin/perl # https://perlmonks.org/?node_id=11105353 # following spirit of my http://www.rosettacode.org/wiki/Compiler/lexi +cal_analyzer#Alternate_Perl_Solution use strict; use warnings; my @tokens; my %reserved = map { $_ => 1 } qw( alignas alignof and and_eq asm atomic_cancel atomic_commit atomic_noexcept auto bitand bitor bool break case catch char char16_t char32_t class compl concept const constexpr const_cast continue co_await co_return co_yield decltype default delete do double dynamic_cast else enum explicit export extern false float for friend goto if import inline int long module mutable namespace new noexcept not not_eq nullptr operator or or_eq private protected public register reinterpret_cast requires return short signed sizeof static static_assert static_cast struct switch synchronized template this thread_local throw true try typedef typeid typename union unsigned using virtual void volatile wchar_t while xor xor_eq ); my %Character = ( # Single characters by name '(' => 'LeftParen', ')' => 'RightParen', '[' => 'LeftSquare', ']' => 'RightSquare', '{' => 'LeftCurly', '}' => 'RightCurly', '<' => 'LessThan', '>' => 'GreaterThan', '=' => 'Equal', '+' => 'Plus', '-' => 'Minus', '*' => 'Asterisk', '/' => 'Slash', '#' => 'Hash', '.' => 'Dot', ',' => 'Comma', ':' => 'Colon', ';' => 'Semicolon', "'" => 'SingleQuote', '"' => 'DoubleQuote', '|' => 'Pipe', ); my %MultiOps = ( '>>' => 'RightShift', '<<' => 'LeftShift', '<=' => 'LessThanOrEqual', ); my $regex = qr/ \G (?| \s+ (?{ undef }) | \/\/.* (?{ undef }) | \d+(?:\.\d*)? (?{ 'Number' }) | \.\d+ (?{ 'Number' }) | \w+ (?{ $reserved{$&} ? 'reserved' : 'Identifier' }) | "([^"]*)" (?{ [ 'string', $1 ] }) | (?<!:)::(?!:) (?{ 'dblColon' }) | (?:<<|>>|<=) (?{ $MultiOps{$&} }) | . (?{ $Character{$&} or 'character' }) ) /x; $_ = join '', <DATA>; defined $^R and push @tokens, ref $^R ? $^R : [ $^R, $& ] while /$rege +x/gc; use Data::Dump 'dd'; dd @tokens; __DATA__ // comment should not appear main(void) { int foo = 1 << 5; puts("testing"); exit(0); // done }

    Outputs:

    ( ["Identifier", "main"], ["LeftParen", "("], ["reserved", "void"], ["RightParen", ")"], ["LeftCurly", "{"], ["reserved", "int"], ["Identifier", "foo"], ["Equal", "="], ["Number", 1], ["LeftShift", "<<"], ["Number", 5], ["Semicolon", ";"], ["Identifier", "puts"], ["LeftParen", "("], ["string", "testing"], ["RightParen", ")"], ["Semicolon", ";"], ["Identifier", "exit"], ["LeftParen", "("], ["Number", 0], ["RightParen", ")"], ["Semicolon", ";"], ["RightCurly", "}"], )

    At least it doesn't fail any of your provided test cases :)

      At least it doesn't fail any of your provided test cases :)

      But a single escaped double quote in the example is sufficient to make it return garbage.

      // comment should not appear main(void) { int foo = 1 << 5; puts("testing \"quoted\" string"); exit(0); // done }
      ... [ 'Identifier', 'puts' ], [ 'LeftParen', '(' ], [ 'string', 'testing \\' ], [ 'Identifier', 'quoted' ], [ 'character', '\\' ], [ 'string', ' string' ], [ 'RightParen', ')' ], [ 'Semicolon', ';' ], ...

      There is no identifier named quoted in the example C/C++ code. quoted is part of a string passed to puts().

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Sure.

        I did say it is incomplete. That's a part of it. Also many more multi-character ops are missing, such as == != += -= ++ -- etc.

        It is true that no test case with quoted double quotes was provided :)

        My C++ is very rusty, I'm sire there are many more lexical elements that are missing.

Re: Lexing C++
by Eily (Monsignor) on Sep 02, 2019 at 10:39 UTC

    am I stumbling into know gotchas?
    I think so. What you get might not be C++ code yet, but preprocessor input. Since the preprocessor may add quotes, concatenate some elements, and remove or duplicate code, it might be too early to interpret the data as C++ code. For example my_variable(23) could be turned into "var23="<<var23 by the preprocessor. So your tokenizer would detect and identifier, two parentheses and a number, when there is actually a string a '<<' token (note that << is a single token, not two < in a row) and an identifier in the C++ code.

    Hopefully, C++ actually discourages the use of this kind of changes through the preprocessor, and favours the use of Templates instead. So if those recommendation are enforced, you may be safe trying to do whatever it is you are trying to do.

    Another issue: you neither tokenize strings nor multiline comments, this would lead to a lot of things being interpreted wrongly. Eg:

    /* #define This is not a directive because inside a comment */ " // This is not a comment because inside a string ";

      Hi Eily,

      Thanks for your comments, very useful. I know the code I am looking at makes use of C templates quite a lot, so hopefully there are no evil compiler tricks to trip me up.

      For multi-line comments, strings and such, I was going to rely on my parser using a state machine so it knows when it is inside such a beast. In fact this sort of thing is why I am trying to lex and parse. I had a previous solution mostly based on regex, but handling multi-line strings, comments etc, was one of the issues that was making it hairy and un-maintainable.

      I certainly need to improve my lexer to handle '<<' and friends, the previous comment also highlighted that fact. I think I need to break on \s+ first to better get identifiers all together, then perhaps \b to get multi character tokens, finally per character to get the }); sort of fun. An update is in the works...

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!

        Here's a slightly modified version (still incomplete) of mine that handles several of the problems you mention, like multi-line comments and strings (updated per afoken Re^2: Lexing C++). It also adds the character position of the token as the third item, so the parser can generate better error messages :)

        #!/usr/bin/perl # https://perlmonks.org/?node_id=11105353 # following spirit of my http://www.rosettacode.org/wiki/Compiler/lexi +cal_analyzer#Alternate_Perl_Solution use strict; use warnings; my @tokens; my %reserved = map { $_ => 'reserved' } qw( alignas alignof and and_eq asm atomic_cancel atomic_commit atomic_noexcept auto bitand bitor bool break case catch char char16_t char32_t class compl concept const constexpr const_cast continue co_await co_return co_yield decltype default delete do double dynamic_cast else enum explicit export extern false float for friend goto if import inline int long module mutable namespace new noexcept not not_eq nullptr operator or or_eq private protected public register reinterpret_cast requires return short signed sizeof static static_assert static_cast struct switch synchronized template this thread_local throw true try typedef typeid typename union unsigned using virtual void volatile wchar_t while xor xor_eq ); my %Ops = ( # Single or multiple operators by name '(' => 'LeftParen', ')' => 'RightParen', '[' => 'LeftSquare', ']' => 'RightSquare', '{' => 'LeftCurly', '}' => 'RightCurly', '<' => 'LessThan', '>' => 'GreaterThan', '=' => 'Equal', '+' => 'Plus', '-' => 'Minus', '*' => 'Asterisk', '/' => 'Slash', '#' => 'Hash', '.' => 'Dot', ',' => 'Comma', ':' => 'Colon', ';' => 'Semicolon', "'" => 'SingleQuote', '"' => 'DoubleQuote', '|' => 'Pipe', '>>' => 'RightShift', # remember to sort by longest first '<<' => 'LeftShift', '<=' => 'LessThanOrEqual', '>=' => 'GreaterThanOrEqual', '||' => 'LogicalOr', '&&' => 'LogicalAnd', '+=' => 'PlusEqual', '-=' => 'MinusEqual', '*=' => 'TimesEqual', '/=' => 'DivideEqual', ); my $matchops = qr/(?:@{[ join '|', map quotemeta, sort { length $b <=> length $a } # longest first sort keys %Ops ]})/; my $regex = qr/ \G (?| \s+ (?{ undef }) | \/\/.* (?{ undef }) | \/\*[\s\S]*?\*\/ (?{ undef }) # assuming non-nested | \#(.+) (?{ [ 'Directive', $1 ] }) | \d+(?:\.\d*)? (?{ 'Number' }) | \.\d+ (?{ 'Number' }) | \w+ (?{ $reserved{$&} or 'Identifier' }) | "((?:\\.|[^\\\n"])*)" (?{ [ 'string', $1 =~ s!\\(.)!$1!gr ] }) | '([^\\'\n])' (?{ [ 'Number', ord $1 ] }) | (?<!:)::(?!:) (?{ 'dblColon' }) | $matchops (?{ $Ops{$&} }) | . (?{ 'ERROR: unexpected character' }) ) /x; $_ = (join '', <DATA>) =~ s/\\\n/ /gr; defined $^R and push @tokens, [ ref $^R ? @{$^R} : ( $^R, $& ), $-[0] +] while /$regex/g; use Data::Dump 'dd'; dd @tokens; __DATA__ #define TheAnswerToLifeTheUniverseAndEverything \ (42) int main(int argc, char *argv[ ]) { int $foo = 1 << 5; /* multiline comment */ puts("testing a \"quoted\" string with $ sign"); exit(0); // success }

        Outputs:

        ( [ "Directive", "define TheAnswerToLifeTheUniverseAndEverything \t(42)", 0, ], ["reserved", "int", 55], ["Identifier", "main", 59], ["LeftParen", "(", 63], ["reserved", "int", 64], ["Identifier", "argc", 68], ["Comma", ",", 72], ["reserved", "char", 74], ["Asterisk", "*", 79], ["Identifier", "argv", 80], ["LeftSquare", "[", 84], ["RightSquare", "]", 86], ["RightParen", ")", 87], ["LeftCurly", "{", 90], ["reserved", "int", 93], ["ERROR: unexpected character", "\$", 97], ["Identifier", "foo", 98], ["Equal", "=", 102], ["Number", 1, 104], ["LeftShift", "<<", 106], ["Number", 5, 109], ["Semicolon", ";", 110], ["Identifier", "puts", 140], ["LeftParen", "(", 144], ["string", "testing a \"quoted\" string with \$ sign", 145], ["RightParen", ")", 186], ["Semicolon", ";", 187], ["Identifier", "exit", 190], ["LeftParen", "(", 194], ["Number", 0, 195], ["RightParen", ")", 196], ["Semicolon", ";", 197], ["RightCurly", "}", 211], )
Re: Lexing C++
by kikuchiyo (Hermit) on Sep 02, 2019 at 21:24 UTC

    I think this approach is bound to fail.

    C++ is a very complex language; e.g. here is an example program (scroll down to the first comment) that's syntactically valid if and only if the integer in the main function is a prime.

    Writing a full C++ parser would be a large undertaking, especially if you start with a simple approach like the ones above, and try to add support for new language features gradually. It would be an uphill battle. At some point you'd have to stop and live with the risk that for certain inputs your incomplete parser would silently produce broken output.

    So why not use the one program that is known to be able to parse C++: a C++ compiler?

    Clang, for example, has options to dump the tokens it encounters:

    clang foo.cpp -Xclang -dump-tokens clang foo.cpp -Xclang -dump-raw-tokens

    or the full AST:

    clang foo.cpp -Xclang -ast-dump

    Parsing or make sense of the latter is still not trivial, but not as hopeless as writing a full C++ parser on your own.

      kikuchiyo++

      I am actually 'only' after parsing out some details of the function signatures, some info on variable types and classes. Parsing the code to the level of execution is not my goal. Having said that I like the idea of handing it to an existing parser and getting the AST back. And I just found there is a package for Clang in the standard Ubuntu repos.

      Thanks!

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!