So folks,
today I need to makes some sense of C++ files. I will need to parse out function signatures, and I have tried this with regex before, it gets messy especially around templates. Now a similar requirement has reared it's head so step one, lex the code. without further ado here is my attempt at lexing C++. The Lexer is called with an open file handle to a C++ source file. This is lexed into an array of tokens, that is then handed on to the parser.
What do you think, is this going to give me a nice labeled stream and make parsing a dream, or am I stumbling into know gotchas? Does it qualify as cool?
sub Lex { my $input = shift; # Get file handle to a C++ file my @tokens; # This will contain the tokenised file ready fo +r our parser my @longPatterns = ( ['Comment' => qr|//.*| ], ['Directive' => qr|^\s*#define.*| ], ['Directive' => qr|^\s*#elif.*| ], ['Directive' => qr|^\s*#else.*| ], ['Directive' => qr|^\s*#error.*| ], ['Directive' => qr|^\s*#endif.*| ], ['Directive' => qr|^\s*#if.*| ], ['Directive' => qr|^\s*#ifdef.*| ], ['Directive' => qr|^\s*#ifndef.*| ], ['Directive' => qr|^\s*#include.*| ], ['Directive' => qr|^\s*#line.*| ], ['Directive' => qr|^\s*#undef.*| ], ['Directive' => qr|^\s*#pragma.*| ], ); my @reserved = qw( alignas alignof and and_eq asm atomic_cancel atomic_commit ato +mic_noexcept auto bitand bitor bool break case catch char char16_t char32_t class compl conce +pt const constexpr const_cast continue co_await co_return co_yield decltype default delete d +o double dynamic_cast else enum explicit export extern false float for friend goto if imp +ort inline int long module mutable namespace new noexcept not not_eq nullptr operator or +or_eq private protected public register reinterpret_cast requires return short signed sizeof +static static_assert static_cast struct switch synchronized template this thread_lo +cal throw true try typedef typeid typename union unsigned using virtual void volatile wch +ar_t while xor xor_eq ); my @patterns = ( # Multi character patterns to lex out ['Number' => qr/^\d[\.\d]*$/ ], ['Identifier' => qr/\w+/ ], ['dblColon' => qr/(?<!:)::(?!:)/ ], ); my %Character = ( # Single characters by name '(' => 'LeftParen', ')' => 'RightParen', '[' => 'LeftSquare', ']' => 'RightSquare', '{' => 'LeftCurly', '}' => 'RightCurly', '<' => 'LessThan', '>' => 'GreaterThan', '=' => 'Equal', '+' => 'Plus', '-' => 'Minus', '*' => 'Asterisk', '/' => 'Slash', '#' => 'Hash', '.' => 'Dot', ',' => 'Comma', ':' => 'Colon', ';' => 'Semicolon', "'" => 'SingleQuote', '"' => 'DoubleQuote', '|' => 'Pipe', ); while (my $line = <$input>) { chomp $line; my $matched; for my $patt (@longPatterns) { # some to evaluate on the entire li +ne if ($line =~ s|($patt->[1])|| ) { my $token = $1; print "$patt->[0]\t$token\n" if $debug; push @tokens, [$patt->[0], $token]; } } print "got> $line\n" if $debug and $line =~/\S/; LABEL: for my $token (split /\b/, $line) { # now handle token at a + time $token =~ s/^\s+|\s+$//g; # Strip whitespace next unless $token; # anything left? print "Lexing $token\n" if $debug; for my $word (@reserved) { # look for reserve words if ($word eq $token) { # A C++ reserve word, simples print "reserved\t$token\n" if $debug; push @tokens, ['reserved', $token]; next LABEL; } } for my $pat (@patterns) { # Try multi character patterns n +ext if ($token =~ /$pat->[1]/) { print "$pat->[0]\t$token\n" if $debug; push @tokens, [$pat->[0], $token]; next LABEL } } unless ($matched) { # Didn't match multichar pattern, so h +andle character at a time for my $char (split //, $token) { print "Lexing by character $char\n" if $debug; if (exists $Character{$char}) { print "$Character{$char}\t$char\n" if $debug; push @tokens, [$Character{$char}, $char]; } else { print "Failed to match $char\n"; } } } } Parser(\@tokens) } }
Cheers,
R.
Pereant, qui ante nos nostra dixerunt!
Update
More compiler directives added
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Lexing C++
by tybalt89 (Monsignor) on Sep 01, 2019 at 16:33 UTC | |
by afoken (Chancellor) on Sep 02, 2019 at 09:19 UTC | |
by tybalt89 (Monsignor) on Sep 02, 2019 at 11:18 UTC | |
Re: Lexing C++
by Eily (Monsignor) on Sep 02, 2019 at 10:39 UTC | |
by Random_Walk (Prior) on Sep 02, 2019 at 19:51 UTC | |
by tybalt89 (Monsignor) on Sep 02, 2019 at 21:18 UTC | |
Re: Lexing C++
by kikuchiyo (Hermit) on Sep 02, 2019 at 21:24 UTC | |
by Random_Walk (Prior) on Sep 03, 2019 at 17:59 UTC |
Back to
Cool Uses for Perl