Re: Lexing C++

am I stumbling into know gotchas?

I think so. What you get might not be C++ code yet, but preprocessor input. Since the preprocessor may add quotes, concatenate some elements, and remove or duplicate code, it might be too early to interpret the data as C++ code. For example my_variable(23) could be turned into "var23="<<var23 by the preprocessor. So your tokenizer would detect and identifier, two parentheses and a number, when there is actually a string a '<<' token (note that << is a single token, not two < in a row) and an identifier in the C++ code.

Hopefully, C++ actually discourages the use of this kind of changes through the preprocessor, and favours the use of Templates instead. So if those recommendation are enforced, you may be safe trying to do whatever it is you are trying to do.

Another issue: you neither tokenize strings nor multiline comments, this would lead to a lot of things being interpreted wrongly. Eg:

/*
#define This is not a directive because inside a comment
*/
" // This is not a comment because inside a string ";
[download]

Comment on Re: Lexing C++ Download Code

Replies are listed 'Best First'.
Re^2: Lexing C++ by Random_Walk (Prior) on Sep 02, 2019 at 19:51 UTC
Hi Eily, Thanks for your comments, very useful. I know the code I am looking at makes use of C templates quite a lot, so hopefully there are no evil compiler tricks to trip me up. For multi-line comments, strings and such, I was going to rely on my parser using a state machine so it knows when it is inside such a beast. In fact this sort of thing is why I am trying to lex and parse. I had a previous solution mostly based on regex, but handling multi-line strings, comments etc, was one of the issues that was making it hairy and un-maintainable. I certainly need to improve my lexer to handle '<<' and friends, the previous comment also highlighted that fact. I think I need to break on \s+ first to better get identifiers all together, then perhaps \b to get multi character tokens, finally per character to get the }); sort of fun. An update is in the works... Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l]
Re^3: Lexing C++ by tybalt89 (Monsignor) on Sep 02, 2019 at 21:18 UTC
Here's a slightly modified version (still incomplete) of mine that handles several of the problems you mention, like multi-line comments and strings (updated per afoken Re^2: Lexing C++). It also adds the character position of the token as the third item, so the parser can generate better error messages :) #!/usr/bin/perl # https://perlmonks.org/?node_id=11105353 # following spirit of my http://www.rosettacode.org/wiki/Compiler/lexi +cal_analyzer#Alternate_Perl_Solution use strict; use warnings; my @tokens; my %reserved = map { $_ => 'reserved' } qw( alignas alignof and and_eq asm atomic_cancel atomic_commit atomic_noexcept auto bitand bitor bool break case catch char char16_t char32_t class compl concept const constexpr const_cast continue co_await co_return co_yield decltype default delete do double dynamic_cast else enum explicit export extern false float for friend goto if import inline int long module mutable namespace new noexcept not not_eq nullptr operator or or_eq private protected public register reinterpret_cast requires return short signed sizeof static static_assert static_cast struct switch synchronized template this thread_local throw true try typedef typeid typename union unsigned using virtual void volatile wchar_t while xor xor_eq ); my %Ops = ( # Single or multiple operators by name '(' => 'LeftParen', ')' => 'RightParen', '[' => 'LeftSquare', ']' => 'RightSquare', '{' => 'LeftCurly', '}' => 'RightCurly', '<' => 'LessThan', '>' => 'GreaterThan', '=' => 'Equal', '+' => 'Plus', '-' => 'Minus', '' => 'Asterisk', '/' => 'Slash', '#' => 'Hash', '.' => 'Dot', ',' => 'Comma', ':' => 'Colon', ';' => 'Semicolon', "'" => 'SingleQuote', '"' => 'DoubleQuote', '\|' => 'Pipe', '>>' => 'RightShift', # remember to sort by longest first '<<' => 'LeftShift', '<=' => 'LessThanOrEqual', '>=' => 'GreaterThanOrEqual', '\|\|' => 'LogicalOr', '&&' => 'LogicalAnd', '+=' => 'PlusEqual', '-=' => 'MinusEqual', '=' => 'TimesEqual', '/=' => 'DivideEqual', ); my $matchops = qr/(?:@{[ join '\|', map quotemeta, sort { length $b <=> length $a } # longest first sort keys %Ops ]})/; my $regex = qr/ \G (?\| \s+ (?{ undef }) \| \/\/.* (?{ undef }) \| \/\[\s\S]?\\/ (?{ undef }) # assuming non-nested \| \#(.+) (?{ [ 'Directive', $1 ] }) \| \d+(?:\.\d)? (?{ 'Number' }) \| \.\d+ (?{ 'Number' }) \| \w+ (?{ $reserved{$&} or 'Identifier' }) \| "((?:\\.\|[^\\\n"]))" (?{ [ 'string', $1 =~ s!\\(.)!$1!gr ] }) \| '([^\\'\n])' (?{ [ 'Number', ord $1 ] }) \| (?<!:)::(?!:) (?{ 'dblColon' }) \| $matchops (?{ $Ops{$&} }) \| . (?{ 'ERROR: unexpected character' }) ) /x; $_ = (join '', <DATA>) =~ s/\\\n/ /gr; defined $^R and push @tokens, [ ref $^R ? @{$^R} : ( $^R, $& ), $-[0] +] while /$regex/g; use Data::Dump 'dd'; dd @tokens; __DATA__ #define TheAnswerToLifeTheUniverseAndEverything \ (42) int main(int argc, char argv[ ]) { int $foo = 1 << 5; /* multiline comment / puts("testing a \"quoted\" string with $ sign"); exit(0); // success } [download] Outputs: ( [ "Directive", "define TheAnswerToLifeTheUniverseAndEverything \t(42)", 0, ], ["reserved", "int", 55], ["Identifier", "main", 59], ["LeftParen", "(", 63], ["reserved", "int", 64], ["Identifier", "argc", 68], ["Comma", ",", 72], ["reserved", "char", 74], ["Asterisk", "", 79], ["Identifier", "argv", 80], ["LeftSquare", "[", 84], ["RightSquare", "]", 86], ["RightParen", ")", 87], ["LeftCurly", "{", 90], ["reserved", "int", 93], ["ERROR: unexpected character", "\$", 97], ["Identifier", "foo", 98], ["Equal", "=", 102], ["Number", 1, 104], ["LeftShift", "<<", 106], ["Number", 5, 109], ["Semicolon", ";", 110], ["Identifier", "puts", 140], ["LeftParen", "(", 144], ["string", "testing a \"quoted\" string with \$ sign", 145], ["RightParen", ")", 186], ["Semicolon", ";", 187], ["Identifier", "exit", 190], ["LeftParen", "(", 194], ["Number", 0, 195], ["RightParen", ")", 196], ["Semicolon", ";", 197], ["RightCurly", "}", 211], ) [download]	[reply] [d/l] [select]


laziness, impatience, and hubris
	PerlMonks