Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Operator Precedence Parser

by bart (Canon)
on Jun 09, 2006 at 15:39 UTC ( [id://554516]=perlmeditation: print w/replies, xml ) Need Help??

A few days ago, Limbic~Region was eagerly hunting all over the place for every bit of information he could find on parsing, in particular, on Yacc-style and recursive descent parsers. He was doubting whether there is any use in trying to come up with an implementation idea of your own, and write your own parser, when such decent toolkits are readily available.

I suddenly recalled a parser that I wrote myself, in the days before I had any formal education about compilers. When later I looked into the Dragon Book, I was rather amazed that the system I had come up with, was not described. There is someting that comes somewhat in the neighbourhood, called "Operator Precedence Parsing" (section 4.6). But, at least, I do not have to invent any new weird math symbols in order to explain how it works. :)

Its purpose is to parse and calculate infix expressions, stuff like "2+3*(4+5)". As is the case with recursive descent parsing, it is easy to implement by hand — actually, easier; in fact it's so easy that I had no trouble at all implementing it in pure Z80 assembler. But, in contrast with recursive descent parsing, precedence (calculation order) rules for the various operators are defined by a precedence table: each operator has a numerical precedence value (the higher the value, the tighter it binds) and an associativity type (left or right). As a result, new infix operators can easily be added without changing any code, but by merely adding their entries to the operators table.

The algorithm

The grammar it parses has the basic form:

expr := value ( op value )*
where value is something like a number or a variable's name, and op is an infix operator.

The code layout is a bit different from the above spec, and in pseudocode, looks like this:

sub parse_expr { my($value, @stack); while(1) { $value = parse_value or die "Parse error"; my $op = parse_operator or last; $value = process(\@stack, $value, $op); # intermediate proces +sing push @stack, { value => $value, op => $op }; # store for the +next round } $value = process(\@stack, $value); # final processing return $value; }

process() uses a stack to hold intermediate values and operators. In addition, there are a few variables holding the last value that got parsed or calculated ($value), and the last operator that was just seen ($op).

Lexing (tokenizing) the data happens in 2 independent lexers: one for the core values, and one for the operators. Which lexer is used depends on whether an operator or a value is expected. That way you can reuse the same symbol for different roles without conflict, such as using "-" for a subtraction operator, and as an unary minus.

If a value is expected, either at the start of an expression, or following an infix operator, and none is recognized, a parsing error is raised.

If an operator is expected and none is seen, it is merely assumed that the end of the expression has been reached, and the parser returns the calculated value, after completing any postponed operations. The input pointer remains where it is, and you can continue parsing using any other means, where it stopped.

Processing happens like this: the precedence of the last recognized operator is compared to the precedence of the operator on top of the stack. If it's higher, the current value and operator are pushed onto the stack and the next item will be parsed first; if it's lower, the previous postponed calculation will be performed first, possibly repeated for any older values and operators on the stack. Finally, the final result, and the operator, are pushed onto the stack.

You can compare this to the LALR (Yacc style) parsers, where people talk about shift (= push) and reduce (= calculate).

If their precedence is the same, the associativity of the operator (on the stack) determines whether a calculation will be performed first (left), or postponed (right). I strongly recommend against ever using different types associativity for the same precedence level, so it doesn't matter which operator's associativity you use.

That's it. That's the whole parser.

How about nested parens? Simple, use the parsing rule

value := '(' expr ')'

That is, if you recognize an opening paren, recurse into parsing an expression, end by trying to match a closing paren. (As a closing paren is not recognized as an operator, hence the expression parser will just return when it gets there.)

This way, you can nest parens to any depth.

Implementation

And now: code. This first implementation stays close to the theoretical description. A lot of the following code is not necessary, as it serves to show what is going on.

#!perl -w use strict; my %var; my %op = ( '+' => { prec => 10, assoc => 'L', exec => sub { $_[0] + $_[1] }}, '-' => { prec => 10, assoc => 'L', exec => sub { $_[0] - $_[1] }}, '*' => { prec => 20, assoc => 'L', exec => sub { $_[0] * $_[1] }}, '/' => { prec => 20, assoc => 'L', exec => sub { $_[0] / $_[1] }}, '%' => { prec => 20, assoc => 'L', exec => sub { $_[0] % $_[1] }}, '**' => { prec => 30, assoc => 'R', exec => sub { $_[0] ** $_[1] }}, ); # fields use constant VALUE => 0; use constant OP => 1; sub parse_expr { my @stack; while (1) { trace(); my($value) = parse_value() or die "Parse error at " . where(); trace("value=$value"); if(/\G\s*(\*\*|[+\-*\/%\\])/gc) { # operator my $op = $1; trace("op=$op"); while (@stack and (($op{$op}{prec} < $op{$stack[-1][OP]}{p +rec}) or ($op{$op}{prec} == $op{$stack[-1][OP]}{prec}) and $op{$stack[-1][OP]}{assoc} eq "L")) { my($lhs, $op) = @{pop @stack}; trace("popping $lhs $op"); $value = $op{$op}{exec}->($lhs, $value); trace("result = $value"); } trace("pushing $value $op"); push @stack, [ $value, $op ]; } else { # no more while(@stack) { my($lhs, $op) = @{pop @stack}; trace("popping $lhs $op"); $value = $op{$op}{exec}->($lhs, $value); trace("result = $value"); } trace("returning $value"); return $value; } } } sub parse_value { /\G\s+/gc; if(/\G\+/gc) { # '+' value trace("Unary plus"); return parse_value(); } if(/\G-/gc) { # '-' value trace("Unary minus"); return -parse_value(); } if(/\G((?:\d+\.?\d*|\.\d+)(?i:E[+-]?\d+)?)/gc) { # number return $1; } if(/\G((?i:[a-z]\w*))/gc) { # variable return $var{$1}; } if(/\G\(/gc) { # '(' expr ')' my $value = parse_expr(); /\G\s*/gc; /\G\)/gc or die "Parse error: ')' expected at: " . where() ; return $value; } return; } sub evaluate { local $_ = shift; my $value = parse_expr(); /\G\s+/gc; /\G$/gc or die sprintf "Junk characters at end: \"%s\"", where(); return $value; } sub where { # debugging info my $s = $_; substr($s, pos || 0, 0) = "\267"; return $s; } sub trace { my($pkg, $file, $line) = caller; printf STDERR "Line %d \"%s\" %s\n", $line, where(), @_ ? shift : +""; } %var = ( a => 101, b => 7 ); $_ = "20+3*a+10*-b-5*(3 +2)*5"; $\ = "\n"; print evaluate($_);

As an implementation, it's not ideal: it contains code duplication, the precedence/associativity test is quite verbose, and it's not machine code friendly (low on resources, using as few variables and datastructures as possible), due to the local stacks in the expression parser. That all can be remedied using a few simple tricks.

While processing the data on the stack, you do not have to check the stack depth all the time, as you can use a sentinel value on the stack instead. Just make sure the precedence values of all operators are > 0, and give the sentinel a precedence value of 0, lower than any precedence values for the operators, then every calculation will be performed before parsing ever finishes, while the sentinel just sits safely on the stack. This way, we can even use a global stack, such as the return stack when coding in assembler, in a perfectly safe way.

Associativity can be tackled by pretending a left associative operator on the stack has a higher precedence than the same operator that just got parsed, while a right associative operator has a lower or equal (thus, not higher) precedence on stack than by default, and then the calculation will be postponed. In practice, we just have to make sure the precedence values in the table are even, and increment on the precedence value, just before it gets pushed onto the stack, for a left associative operator.

In the follwoing code, only the sub parse_expr has been changed, but I've posted the whole code to ease copy/run on the program.

#!perl -w use strict; my %var; my %op = ( '+' => { prec => 10, assoc => 'L', exec => sub { $_[0] + $_[1] }}, '-' => { prec => 10, assoc => 'L', exec => sub { $_[0] - $_[1] }}, '*' => { prec => 20, assoc => 'L', exec => sub { $_[0] * $_[1] }}, '/' => { prec => 20, assoc => 'L', exec => sub { $_[0] / $_[1] }}, '%' => { prec => 20, assoc => 'L', exec => sub { $_[0] % $_[1] }}, '**' => { prec => 30, assoc => 'R', exec => sub { $_[0] ** $_[1] }}, ); # fields use constant VALUE => 0; use constant OP => 1; sub parse_expr { my @stack; while (1) { trace(); my($value) = parse_value() or die "Parse error at " . where(); trace("value=$value"); if(/\G\s*(\*\*|[+\-*\/%\\])/gc) { # operator my $op = $1; trace("op=$op"); while (@stack and (($op{$op}{prec} < $op{$stack[-1][OP]}{p +rec}) or ($op{$op}{prec} == $op{$stack[-1][OP]}{prec}) and $op{$stack[-1][OP]}{assoc} eq "L")) { my($lhs, $op) = @{pop @stack}; trace("popping $lhs $op"); $value = $op{$op}{exec}->($lhs, $value); trace("result = $value"); } trace("pushing $value $op"); push @stack, [ $value, $op ]; } else { # no more while(@stack) { my($lhs, $op) = @{pop @stack}; trace("popping $lhs $op"); $value = $op{$op}{exec}->($lhs, $value); trace("result = $value"); } trace("returning $value"); return $value; } } } sub parse_value { /\G\s+/gc; if(/\G\+/gc) { # '+' value trace("Unary plus"); return parse_value(); } if(/\G-/gc) { # '-' value trace("Unary minus"); return -parse_value(); } if(/\G((?:\d+\.?\d*|\.\d+)(?i:E[+-]?\d+)?)/gc) { # number return $1; } if(/\G((?i:[a-z]\w*))/gc) { # variable return $var{$1}; } if(/\G\(/gc) { # '(' expr ')' my $value = parse_expr(); /\G\s*/gc; /\G\)/gc or die "Parse error: ')' expected at: " . where() ; return $value; } return; } sub evaluate { local $_ = shift; my $value = parse_expr(); /\G\s+/gc; /\G$/gc or die sprintf "Junk characters at end: \"%s\"", where(); return $value; } sub where { # debugging info my $s = $_; substr($s, pos || 0, 0) = "\267"; return $s; } sub trace { my($pkg, $file, $line) = caller; printf STDERR "Line %d \"%s\" %s\n", $line, where(), @_ ? shift : +""; } %var = ( a => 101, b => 7 ); $_ = "20+3*a+10*-b-5*(3 +2)*5"; $\ = "\n"; print evaluate($_);

Extending the parser

Extending the parser to recognize new syntax is easy enough. For example, if you have it recognize a name followed by an opening paren, you can treat this as a prefix function call:

value := name '(' ( expr ( ',' expr ) * )? ')'
(You will have to recognize the opening paren before you assume a name is just a variable)

One by one, parse (and calculate) the expressions in the arguments list, putting the results onto a stack, and finally, call the associated function.

If you want to recognize and handle assignments to variables in the form of "a=expr", then you'll have to implement a way to have lvalues. I can think up 2 basic routes from the top of my head. One is to use Perl voodoo, and have a variable return an object that stringifies/nummifies to the variable's current value, but which you can also use to modify its value. The other is to always return a record structure for the values.

Extending the idea of lvalues to pre-/post- -increment/-decrement, you'll probably have to forbid "++" and "--" for any other use, so "--2" is no longer equivalent to 2.

It'd be very nice to recognize advanced syntax structures, such as "1 < x < 2", in a more programmer-friendly rather than computer-friendly way, meaning "x > 1 and x < 2". I think it's doable, as we're comparing precedence of the two "<" operators anyway, but it'll have to do some exceptional processing for that particular operator.

Other special cases, like optional operators, so that "2a" could mean "2*a</>", or "foo bar" meaning "<c>foo AND bar" for a search engine, are possible in principle, at the cost of extra complexity.

Instead of just calculating a value on the fly, the parser can easily be modified to return a parse tree, or produce postfix code instead.

Application and embedding

This is the basic parser, which, because it doesn't have to recognize everything (it just stops at the first thing it doesn't recognize, it's perfectly happy if it just recognized a valid expression as a prefix), can easily be embedded in other parsers. For example, you can write a handcoded parser to recognize basic command statements, control structures, etc. where this (sub)parser handles expressions, which most programmers find the hardest things to tackle, anyway.

License

Like I said, I invented this parsing method myself (some 15 or more years ago), and you're free to use any code, any derived code, or any reimplementations as you see fit. I'd just like some credit if you do so, so please don't pretend you invented it yourself. (Let's say it's under a BSD-style license.) Just, no patenting, or any other such crap.

update: You might want to take a look at my followup where I show a version that returns a parse tree, in the form of Function objects.

Replies are listed 'Best First'.
Re: Operator Precedence Parser
by Rhandom (Curate) on Jun 09, 2006 at 20:29 UTC
    That is sort of funny. What you have described is nearly the same as what I have implemented for the operator parser in CGI::Ex::Template (CGI::Ex::Template is part of the CGI::Ex suite and implements a full fledged TT2 engine and much of the planned TT3 spec).

    To summarize:

    I parse chains of operators and variables into an array - so
    a + b.c - 1.3 * 2 + 3
    would show up as:
    @tree = qw(a + b.c - 1.3 * 2 + 3);
    As I am parsing I also create a hash of found operators and I note their precedence which I get from a table (it is possible to add any amount of other operators to the table). This hash for this example would look like
    %found = ('+' => 85, '-' => 85, '*' => 90)
    I then call a method called "apply_precedence" passing it the tree and the %found hash. Apply precedence takes the highest precedence operator and splits the @tree array into sub trees whenever it finds an operator. Each of those sub trees recursively calls "apply_precedence" until each sub tree only has one element. The returned elements are placed in an execution optree that looks like '+', 'a', 'b.c'. The previous example would parse down to something like (but with a little bit different syntax for encoding the parsed variables, operators and arguments to expressions:
    ['*', ['+', 'a', ['-', 'b.c', 1.3]], ['+', 2, 3]]


    As a side note, a template language that I wrote for my current company starts with this syntax so the parser doesn't have to worry about precedence.

    I would like to know if there is a formal name for this type of parsing. The rest of the parser is recursive in nature, but just sort of flattens out when it reaches an operator.

    The grammar for this parser is hard coded - but is essentially contained with in the parse_variable, parse_args, and apply_precedence methods (parse_variable should more aptly be called parse_expression as it handles variables, numbers, var or num + var or num, parens, {} and () constructs and string literals. I am always interested in seeing somebody write something faster so if somebody would like to hack away - they are certainly welcome to. Apply precedence also has logic to allow for it to correctly split up ternary and nested ternary operators. Writing this also has helped me see that there may be ways to optimize things even more.

    The perldoc of CGI::Ex::Template includes a section on variables and how they are stored. To quote a large section:
    =head1 VARIABLE PARSE TREE CGI::Ex::Template parses templates into an tree of operations. Even variable access is parsed into a tree. This is done in a manner somewhat similar to the way that TT operates except that nested variables such as foo.bar|baz contain the '.' or '|' in between each name level. Operators are parsed and stored as part of the variable ( +it may be more appropriate to say we are parsing a term or an expression) +. The following table shows a variable or expression and the correspondi +ng parsed tree (this is what the parse_variable method would return). one [ 'one', 0 ] one() [ 'one', [] ] one.two [ 'one', 0, '.', 'two', 0 ] one|two [ 'one', 0, '|', 'two', 0 ] one.$two [ 'one', 0, '.', ['two', 0 ], 0 ] one(two) [ 'one', [ ['two', 0] ] ] one.${two().three} [ 'one', 0, '.', ['two', [], '.', 'three', 0], + 0] 2.34 2.34 "one" "one" "one"|length [ \"one", 0, '|', 'length', 0 ] "one $a two" [ \ [ '~', 'one ', ['a', 0], ' two' ], 0 ] [0, 1, 2] [ \ [ 'array', 0, 1, 2 ], 0 ] [0, 1, 2].size [ \ [ 'array', 0, 1, 2 ], 0, '.', 'size', 0 ] ['a', a, $a ] [ \ [ 'array', 'a', ['a', 0], [['a', 0], 0] ], +0] {a => 'b'} [ \ [ 'hash', 'a', 'b' ], 0 ] {a => 'b'}.size [ \ [ 'hash', 'a', 'b' ], 0, '.', 'size', 0 ] {$a => b} [ \ [ 'hash', ['a', 0], ['b', 0] ], 0 ] 1 + 2 [ \ [ '+', 1, 2 ], 0] a + b [ \ [ '+', ['a', 0], ['b', 0] ], 0 ] a * (b + c) [ \ [ '*', ['a', 0], [ \ ['+', ['b', 0], ['c', +0]], 0 ]], 0 ] (a + b) [ \ [ '+', ['a', 0], ['b', 0] ]], 0 ] (a + b) * c [ \ [ '*', [ \ [ '+', ['a', 0], ['b', 0] ], 0 ] +, ['c', 0] ], 0 ] a ? b : c [ \ [ '?', ['a', 0], ['b', 0], ['c', 0] ], 0 ] a || b || c [ \ [ '||', ['a', 0], [ \ [ '||', ['b', 0], ['c +', 0] ], 0 ] ], 0 ] ! a [ \ [ '!', ['a', 0] ], 0 ] Some notes on the parsing. Operators are parsed as part of the variable and become part of th +e variable tree. Operators are stored in the variable tree using a reference to the + arrayref - which allows for quickly descending the parsed variable tree and determi +ning that the next node is an operator. Parenthesis () can be used at any point in an expression to disamb +iguate precedence. "Variables" that appear to be literal strings or literal numbers are returned as the literal (no operator tree). The following perl can be typed at the command line to view the parsed + variable tree: perl -e 'use CGI::Ex::Template; print CGI::Ex::Template::dump_pars +e("foo.bar + 2")."\n"' Also the following can be included in a template to view the output in + a template: [% USE cet = CGI::Ex::Template %] [%~ cet.dump_parse('foo.bar + 2').replace('\s+', ' ') %]


    The following is the long - but useful operators tree (I recently cleaned up the operator table to be a little more clear for version 2.02 which the head version on CPAN). The table is used to generate the regexes necessary for parsing operators.

    $OPERATORS = [ # type precedence symbols action (undef means pl +ay_operator will handle) ['prefix', 98, ['++'], undef + ], ['prefix', 98, ['--'], undef + ], ['postfix', 98, ['++'], undef + ], ['postfix', 98, ['--'], undef + ], ['infix', 96, ['**', 'pow'], sub { $_[0] ** $_[ +1] } ], ['prefix', 93, ['!'], sub { ! $_[0] + } ], ['prefix', 93, ['-'], sub { @_ == 1 ? 0 - $_ +[0] : $_[0] - $_[1] } ], ['infix', 90, ['*'], sub { $_[0] * $_[ +1] } ], ['infix', 90, ['/'], sub { $_[0] / $_[ +1] } ], ['infix', 90, ['div', 'DIV'], sub { int($_[0] / $_[ +1]) } ], ['infix', 90, ['%', 'mod', 'MOD'], sub { $_[0] % $_[ +1] } ], ['infix', 85, ['+'], sub { $_[0] + $_[ +1] } ], ['infix', 85, ['-'], sub { @_ == 1 ? 0 - $_ +[0] : $_[0] - $_[1] } ], ['infix', 85, ['~', '_'], sub { join "", @_ + } ], ['infix', 80, ['<'], sub { $_[0] < $_[ +1] } ], ['infix', 80, ['>'], sub { $_[0] > $_[ +1] } ], ['infix', 80, ['<='], sub { $_[0] <= $_[ +1] } ], ['infix', 80, ['>='], sub { $_[0] >= $_[ +1] } ], ['infix', 80, ['lt'], sub { $_[0] lt $_[ +1] } ], ['infix', 80, ['gt'], sub { $_[0] gt $_[ +1] } ], ['infix', 80, ['le'], sub { $_[0] le $_[ +1] } ], ['infix', 80, ['ge'], sub { $_[0] ge $_[ +1] } ], ['infix', 75, ['==', 'eq'], sub { $_[0] eq $_[ +1] } ], ['infix', 75, ['!=', 'ne'], sub { $_[0] ne $_[ +1] } ], ['infix', 70, ['&&'], undef + ], ['infix', 65, ['||'], undef + ], ['infix', 60, ['..'], sub { $_[0] .. $_[ +1] } ], ['ternary', 55, ['?', ':'], undef + ], ['assign', 53, ['+='], sub { $_[0] + $_[ +1] } ], ['assign', 53, ['-='], sub { $_[0] - $_[ +1] } ], ['assign', 53, ['*='], sub { $_[0] * $_[ +1] } ], ['assign', 53, ['/='], sub { $_[0] / $_[ +1] } ], ['assign', 53, ['%='], sub { $_[0] % $_[ +1] } ], ['assign', 53, ['**='], sub { $_[0]** $_[ +1] } ], ['assign', 53, ['~=', '_='], sub { $_[0] . $_[ +1] } ], ['assign', 52, ['='], undef + ], ['prefix', 50, ['not', 'NOT'], sub { ! $_[0] + } ], ['infix', 45, ['and', 'AND'], undef + ], ['infix', 40, ['or', 'OR'], undef + ], ];
    Again - it is sort of funny to see the same ideas discovered and rediscovered over and over.

    my @a=qw(random brilliant braindead); print $a[rand(@a)];

      What about associativity?

      • Is 6 / 5 * 4 equal to (6 / 5) * 4 (like Perl) or to 6 / (5 * 4)?
      • Is 2 ** 3 ** 4 equal to (2 ** 3) ** 4 or to 2 ** (3 ** 4) (like Perl)?

      Deviances from Perl's behaviour should be documented, IMHO.

      For fun since it doesn't really matter, something else to look at is the interaction between unary minus, post- and pre- -decrement and -increment. For example,

      • a---b
      • a----b
      • a-----b
        Thank you for the reply. I tested to see using the following template.

        <h2>6 / 5 * 4</h2> CET: [% 6 / 5 * 4 %]<br> Perl: [% PERL %]print 6 / 5 * 4 [% END %]<br> <hr> <h2>2 ** 3 ** 4</h2> CET: [% 2 ** 3 ** 4 %]<br> Perl: [% PERL %]print 2 ** 3 ** 4[% END %]<br> <hr> <h2>a---b</h2> CET: [% a = 5; b = 2 %][% a---b %]<br> Perl: [% PERL %]$a = 5; $b = 2; print $a---$b[% END %]<br> <hr> <h2>a--- -b</h2> CET: [% a = 5; b = 2 %][% a--- -b %]<br> Perl: [% PERL %]$a = 5; $b = 2; print $a--- -$b[% END %]<br> <hr> <h2>a--- --b</h2> CET: [% a = 5; b = 2 %][% a--- --b %]<br> Perl: [% PERL %]$a = 5; $b = 2; print $a--- --$b[% END %]<br> <hr>
        It printed out:
        6 / 5 * 4 CET: 0.3 Perl: 4.8 2 ** 3 ** 4 CET: 2.41785163922926e+24 Perl: 2.41785163922926e+24 a---b CET: 3 Perl: 3 a--- -b CET: 7 Perl: 7 a--- --b CET: 4 Perl: 4
        So - it looks like I need to fix my right vs left vs non-associative. I'll add that to the table and change the parser (it is always doing right right now - it used to always do left - it will be trivial and won't even cause a speed hit to allow it to do both). Thank you - I knew about precedence and precedence makes complete sense - associativity rules seem like they are a little more arbitrary and it seems to cry foul to the user that it isn't consistent (such is legacy). In the perl6 operators table they don't even mention if the operator group is right or left (though it probably does elsewhere in the doc).

        So - the infix type will go away in CGI::Ex::Template and will be replaced by left right and non. It needs to match perl.

        I'll get that out hopefully soon. Thanks again.

        Oh - also to note - the a----b and a-----b examples are both parse errors because the longest term rule makes them try and find a prefix -- immediately after the postfix --. Adding a space after the first three - disabiguates both cases.

        my @a=qw(random brilliant braindead); print $a[rand(@a)];
Re: Operator Precedence Parser
by hv (Prior) on Jun 10, 2006 at 08:51 UTC

    I don't know if I triggered L~R's hunt, but I was asking in the CB about modules for expression parsing and he expressed an interest.

    I was looking for something that would be supplied with a) the string to parse, and b) the parse rules - a set of functions, binary and unary operators, and atoms. Something like this:

    Expr::Parse->new({ function => [ { name => 'p', type => 'bool', args => [ 'int' ], test => sub { _is_prime($_[1]) }, }, { name => 'rev', type => 'int', args => [ 'int' ], test => sub { scalar reverse $_[1] }, }, ], binop => [ { name => '+', type => 'int', args => [ 'int', 'int' ], prec => 4, test => sub { $_[1] + $_[2] }, }, { name => '*', type => 'int', args => [ 'int', 'int' ], prec => 3, test => sub { $_[1] * $_[2] }, }, { name => '=', type => 'bool', args => [ 'int', 'int' ], prec => 6, test => sub { $_[1] == $_[2] }, }, unop => [ { name => '!', type => 'bool', args => [ 'bool' ], prec => 1, test => sub { $_[1] ? 0 : 1 }, }, ], atom => [ { name => 'const', pat => '\d+', type => 'int', test => sub { $_[1] }, }, ], });

    Nobody in the CB could suggest anything at the time, so I wrote my own constructing a Tree::Simple tree as output. I found it surprisingly hard to write, even treating '(', ',' and ')' as builtins (for the function(arg, list) support), and ignoring the type information - I wrote and threw away hundreds of lines of code in at least half a dozen trial implementations before finally coming up with something workable if not particularly pretty.

    If your parser can easily be extended to handle function calls I'll have a go at adapting it for my application.

    Hugo

      If your parser can easily be extended to handle function calls I'll have a go at adapting it for my application.
      I said it would be easy, didn't I? Well I've got to put my money (er, code) where my mouth is, so I adapted it to process function calls. With this global spec, which defines 2 functions:
      my %function = ( sumsq => sub { my $sum = 0; foreach(@_) { $sum += $_*$_; } return $ +sum; }, # sum of squares sqrt => sub { return sqrt shift; }, );
      I added this piece in parse_value(), just in front of the code to handle variables:
      if(/\G((?i:[a-z]\w*))\s*\(/gc) { # function '(' my $function = $1; $function{$function} or die sprintf "Undefined function '$func +tion' called at: \"%s\"", where(); my @arg; unless(/\G\s*(?=\))/gc) { while(1){ my($value) = parse_expr() or die sprintf "Expression e +xpected at: \"%s\"", where(); push @arg, $value; /\G\s*,/gc or last; } } /\G\s+/gc; /\G\)/gc or die sprintf "Parse error: ')' expected at: \"%s\"" +, where(); trace(sprintf "function '$function' called with %d argument%s" +, scalar @arg, @arg==1 ? "" : "s"); return $function{$function}->(@arg); }
      and with the data string "sumsq(3,2+2)*sqrt(36)", the output is:
      Line 29 "·sumsq(3,2+2)*sqrt(36)" Line 29 "sumsq(·3,2+2)*sqrt(36)" Line 31 "sumsq(3·,2+2)*sqrt(36)" value=3 Line 29 "sumsq(3,·2+2)*sqrt(36)" Line 31 "sumsq(3,2·+2)*sqrt(36)" value=2 Line 36 "sumsq(3,2+·2)*sqrt(36)" op=+ Line 29 "sumsq(3,2+·2)*sqrt(36)" Line 31 "sumsq(3,2+2·)*sqrt(36)" value=2 Line 43 "sumsq(3,2+2·)*sqrt(36)" popping 2 + Line 45 "sumsq(3,2+2·)*sqrt(36)" result = 4 Line 85 "sumsq(3,2+2)·*sqrt(36)" function 'sumsq' called with 2 argume +nts Line 31 "sumsq(3,2+2)·*sqrt(36)" value=25 Line 36 "sumsq(3,2+2)*·sqrt(36)" op=* Line 29 "sumsq(3,2+2)*·sqrt(36)" Line 29 "sumsq(3,2+2)*sqrt(·36)" Line 31 "sumsq(3,2+2)*sqrt(36·)" value=36 Line 85 "sumsq(3,2+2)*sqrt(36)·" function 'sqrt' called with 1 argumen +t Line 31 "sumsq(3,2+2)*sqrt(36)·" value=6 Line 43 "sumsq(3,2+2)*sqrt(36)·" popping 25 * Line 45 "sumsq(3,2+2)*sqrt(36)·" result = 150 150 Stack: This value is never affected

      It seems to work fine, was finished in something like 1/4 hour, and it definitely doesn't take hundreds of lines of code. :)

      I wrote my own constructing a Tree::Simple tree as output.
      For completeness sake, I've rewritten the parser so it produces a parse tree. I took a quick look at Tree::Simple, and I found it too hard to my taste for the little benefit it would give me, so I'm using a handrolled function object instead — yes I'm converting the infix operators into prefix function calls. As an extra benefit, I can use overload to return a symbolic representation of the parse tree when used as a string, or actually evaluate it, when used as a number.

      I think it clearly demontrates its viability as a parser for real work.

      update Now with symbolic (postponed) variables, meaning: if you change the value of the variable in the %var hash, the value used in an evaluation by using a parsed expression in a numerical context, will change accordingly.

      Output:

        Very nice code, much simpler & shorter than I expected. I tried to add an = (Perl or C internal-to-expression assignment) to the operators, though, lowest precedence & right-associative:

        '='=> {prec=> 5,assoc=>'R',exec=>sub { print "DEBUG $_[0]=$_[1]\n"; $_[1]; }, function=>'asg'},

        I added it to the operator pattern:

        if (/\G\s*(\*\*|[=+\-*\/%\\])/gc)

        and kind of defined the LHS variable

        %var=(a=>101,b=>7,q=>'q');

        but it fails because Function::evaluate tries to coerce the value, which I set to the variable name, into a number:

        return $code->(map 0+$_,@{$self->{arguments}});

        I think I have a workaround, but it looks like '=' is a special case because the LHS variable should not be evaluated before calling the assignment function.

      CGI::Ex::Template does all of this.

      You can add arbitrary entries to the $OPERATORS table and then use the built in functions to rebuild the global qr's helping in the parsing.

      The parse_expr method takes a reference to a string (which it will consume) and returns a parsed optree.

      The play_expr method takes the optree and actually executes it.

      The parser allows for TT2 style nested variables, function calls, number, double and single quoted strings, arrays and hashes. It also has proper precedence and associativity parsing.

      The entry in the $OPERATORS table should contain the following:
      type: prefix, postfix, left, right, none, ternary, or assign (right +) precedence: the relative precedence value symbols or names: an arrayref of symbols or operators for that oper +ation function: a code ref to run when that operator is found.


      It isn't entirely general as the variable names are TT2'ish - but it certainly could be used in other applications.

      my @a=qw(random brilliant braindead); print $a[rand(@a)];
Re: Operator Precedence Parser
by ambrus (Abbot) on Jun 10, 2006 at 18:58 UTC

    I have something similar in an interpreter I started to write ages ago in perl, but has abandonned it. The interpreter works but cannot interpret user-defined functions which makes it unsuitable for most purposes. I am now not motivated to continue it.

    You can download the interpreter but I also copy some relevant parts of the code here. You can download the full version and run it with -p to dump the optree instead of interpretting it. (Piping the dump through | sed 's/\bline( /\n&/g' may make it more readable.)

    Another interpreter I wrote in C++ as a school project also has a similar parsing routine.

    Oh well, I don't think anyone would want to read this post this way. I think I'll have to make a simple example parser that works the same way as these two and post it as a meditation if I want any attention.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://554516]
Approved by Limbic~Region
Front-paged by liverpole
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-04-19 01:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found