Breaking The Rules II

All,
This meditation has been sitting, unfinished, on my scratch pad for over a year. It was intended to be a companion to Breaking The Rules. Since I doubt I will finish it before another year has past, I am publishing it now. I have made reference to it a number of times over the last year and someone may benefit from it despite the incompleteness. As a result, there may be some grammatical inconsistencies (mostly tense).

I took on a new project of evaluating mathematical expressions where I broke at least 2 rules.

1. Don't reinvent the wheel
2. If you must, use the right tool for the job

My goal is to show what I learned by not following the rules. I fully expect to find that the rules are there for a reason. That is ok since a lesson learned by being bitten is much more likely to stick then one from admonishment.

Originally, I planned just 3 variations on the project. I was going to roll my own, create a new one using Parse::RecDescent, and just test the existing wheels. I am deviating from this plan because I have learned there is a lot more options than I had anticipated and that some are better than others. This list now includes:

Hand rolled
Parse::RecDescent

Manual stratification and left-recursion elimination
Using the leftop/rightop shortcuts
Using right-recursion and reversing the parse-tree

Existing wheels

Explore alternatives to Parse::RecDescent

Explore the Red Dragon Book for new ideas.

While I expect the lessons I learn to mostly be negative ones that re-inforce the case for following the rules, I won't know for sure until I have tried. I initially thought I could guess what I would encounter based off my experience and knowledge as well as what others had warned me about. I was dead wrong.

My first attempt was hand-rolling. Evaluating a mathematical expression only requires a handful of steps:

Breaking the expression up into tokens or atoms (known as lexing)
Process tokens according to hierachial rules (known as precedence)
When tokens have the same precedence, process them according to sequencing rules (known as associativity)

This is just a fancy way of saying that when you evaluate a mathematical expression you do parens before exponents which you do before multiplication and division which comes before addition and subtraction and you do all this from left to right.

While the following parser is not finished or polished, it accomplishes the task I set out to do. It takes a mult-pass approach to overcome the difficulty of precedence and associativity.

package Math::Expression::Evaluator;
@ISA    = qw(Exporter);
@EXPORT = qw(evaluate);
$VERSION = '0.01';

use strict;
use warnings;
use Carp;

my %op_eval = (
    '^' => sub {$_[0] ** $_[1]},
    '+' => sub {$_[0] +  $_[1]},
    '-' => sub {$_[0] -  $_[1]},
    '*' => sub {$_[0] *  $_[1]},
    '/' => sub {$_[0] /  $_[1]},
);
my %func_eval = (
    abs  => sub { abs $_[0] },
    int  => sub { int $_[0] },
    sqrt => sub { sqrt $_[0] },
);

my ($func_re) = map qr{$_}, join '|', keys %func_eval;
my $oper_re   = qr{[()/*^+-]};
my $numb_re   = qr{[+-]?(?:\d+(?:\.\d*)?|\.\d+)};
my $parser    = qr{($func_re|$numb_re|$oper_re)};

sub evaluate {
    my @stack = @_ == 1 ? parse(@_) : @_;
    return $_[0] if @stack == 1;

    0 while fix_op(\@stack);
    0 while reduce_func(\@stack);
    0 while reduce_paren(\@stack);
    for my $op (qw[^ * / + -]) {
        0 while reduce_op($op, \@stack);
    }
    croak "Unable to reduce to a number: '@stack'" if @stack != 1;
    return evaluate($stack[0]);
}

sub parse {
    my $expr = shift @_;
    my @part = $expr =~ /$parser/g;
    parse_error_check($expr, \@part);
    return @part;
}

sub parse_error_check {
    my ($expr, $part) = @_;
    $expr =~ s/$parser//g;
    croak "Unparseable parts: '$expr'" if $expr !~ /^\s*$/;
    croak "Not a number: '$part->[0]'" if @$part == 1 && ! is_num($par
+t->[0]);
}

sub is_num { return $_[0] =~ /$numb_re/; }

sub fix_op {
    my $stack = shift @_;
    for (1 .. $#$stack) {
        my $atom = $stack->[$_];
        next if ! is_num($atom);
        if ($atom =~ s/^([+-])//) {
            my $op = $1;
            next if $stack->[$_ - 1] =~ m{[(*/+^-]};
            splice(@$stack, $_, 1, $op, $atom);            
            return 1;
        }
    }
    return;
}

sub reduce_func {
    my $stack = shift @_;
    for (0 .. $#$stack) {
        my $atom = $stack->[$_];  
        next if ! is_func($atom);
        croak "Function $atom require parens" if $stack->[$_ + 1] ne '
+(';
        reduce_paren($stack, $_ + 1);
        splice(@$stack, $_, 2, calculate($atom, $stack->[$_ + 1]));
        return 1;
    }
}

sub is_func { return exists $func_eval{$_[0]}; }

sub calculate {
    my ($key, $x, $y, $val) = @_;
    eval { $val = is_func($key)
        ? $func_eval{$key}->($x)
        : $op_eval{$key}->($x, $y)
    };
    croak "Error: $@" if $@;
    return $val;
}

sub reduce_paren {
    my ($stack, $start) = @_;
    $start ||= 0;
    my ($beg, $open);
    for ($start .. $#$stack) {
        my $atom = $stack->[$_];
        next if $atom ne '(' && $atom ne ')';
        $open += $atom eq ')' ? -1 : 1;
        $beg = $_ if ! defined $beg && $atom eq '(';
        next if $open;
        my $len = $_ - $beg + 1;
        splice(@$stack, $beg, $len, evaluate(@{$stack}[$beg + 1 .. $_ 
+- 1]));
        return 1;
    }
    croak "Unbalanced Parens" if $open;
}

sub reduce_op {
    my ($op, $stack) = @_;
    return if @$stack < 3;
    for (0 .. $#$stack - 2) {
        my ($prev, $curr, $next) = @{$stack}[$_ .. $_ + 2];
        next if $curr ne $op;
        croak "Error: '$prev $op $next'" if ! is_num($prev) || ! is_nu
+m($next);
        splice(@$stack, $_, 3, calculate($op, $prev, $next));
        return 1;
    }
    return;
}
'This statement is false';
[download]

Next I decided to tackle Parse::RecDescent. Admittedly, I had no experience with it before starting this project. In fact, I purposely waited until after I had rolled my own to try it to ensure it would not influence my solution in any way.

Attempt 1:
My first naive attempt failed because I assumed that by placing one production before another in alternation, that the entire string would be processed finding that production before attempting the next production. This is not how Parse::RecDescent works. It does not move further on in the string until it has tried everything it knows about first. This resulted in precedence not being followed.

Attempt 2:
I discovered Re: Left-associative binary operators in Parse::RecDescent by blokhead which said that one way to solve this problem was to manually introduce stratification. In a nutshell, you start out by looking for the lowest precedence which includes the next highest precedence as part of its production on up to the highest precedence which includes the lowest precedence production. The following is the minimal test case I used trying manual stratification:

#!/usr/bin/perl
use strict;
use warnings;
use Parse::RecDescent;
use Data::Dumper;

my $grammar = q{
    evaluate        : ADD_SUB

    ADD_SUB         : MULT_DIV_MOD ADD_SUB_OP ADD_SUB
                      { [@item[1,2,3]] }
                      | MULT_DIV_MOD

    ADD_SUB_OP      : '+'  |  '-'

    MULT_DIV_MOD    : GROUP MULT_DIV_MOD_OP MULT_DIV_MOD { [@item[1,2,
+3]] }
                      | GROUP

    MULT_DIV_MOD_OP : '*'  |  '/'  |  '%'

    GROUP           : '(' ADD_SUB ')'
                      { $item[2] }
                      | NUMBER

    NUMBER          : INTEGER  |  FLOAT  |  NAN
    INTEGER         : /[+-]?\d+/
    FLOAT           : /([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?/
    NAN             : /(Inf(inity)?|NaN)/i
};
my $parser = new Parse::RecDescent $grammar;
print Dumper $parser->evaluate('42 - 5 + 1');
# Sees the result as 42 - (5 + 1)
[download]

As you can see. It does not adhere to the necessary left associativity.

Attempt 3:
I tried to compensate for this by adding left-recursion. Left recursion is defined when the first part of a production calls itself. Parse::RecDescent does not support left-recursion but eliminating it (once added) is trivial. I found Eliminating Left Recursion in Parse::RecDescent by demerphq and produced the following:

#!/usr/bin/perl
use strict;
use warnings;
use Parse::RecDescent;
use Data::Dumper;

my $grammar = q{
    evaluate        : ADD_SUB

    #ADD_SUB         : MULT_DIV_MOD ADD_SUB_OP ADD_SUB
    #                  { [@item[1,2,3]] }
    #                  | MULT_DIV_MOD

    # Reverse to ADD_SUB : ADD_SUB ADD_SUB_OP MULT_DIV_MOD | MULT_DIV_
+MOD
    # Use following formula to eliminate left recursion
    # A : A x | y -> A : y R, R : x R | e
    # Let A = ADD_SUB
    # Let x = ADD_SUB_OP MULT_DIV_MOD
    # let y = MULT_DIV_MOD

    ADD_SUB         : MULT_DIV_MOD ADD_SUB_TAIL
                      { [@item[1,2]] }
    ADD_SUB_TAIL    : ADD_SUB_OP MULT_DIV_MOD ADD_SUB_TAIL
                      { [@item[1..3]] }
                      |

    ADD_SUB_OP      : '+'  |  '-'

    # Same as above
    MULT_DIV_MOD      : GROUP MULT_DIV_MOD_TAIL
                        { [@item[1,2]] }
    MULT_DIV_MOD_TAIL : MULT_DIV_MOD_OP GROUP MULT_DIV_MOD_TAIL
                        { [@item[1..3]] }
                        |

    MULT_DIV_MOD_OP : '*'  |  '/'  |  '%'

    GROUP           : '(' ADD_SUB ')'
                      { $item[2] }
                      | NUMBER

    NUMBER          : INTEGER  |  FLOAT  |  NAN
    INTEGER         : /[+-]?\d+/
    FLOAT           : /([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?/
    NAN             : /(Inf(inity)?|NaN)/i
};
my $parser = new Parse::RecDescent $grammar;
print Dumper $parser->evaluate('42 - 5 + 1');
[download]

Unfortunately, the parse tree is so complex, I would need to write a parse tree parser to be able to evaluate the expression. While I did try very hard to make this work, I ultimately abandoned it.

Attempt 4:
I found Re: Order of Precedence in Parse::RecDescent grammar by ikegami which claimed to handle precedence and associativity. Unfortunately, it had a few bugs that I worked with ikegami to correct. Additionally, it only parsed - it did not evaluate. After several exchanges with ikegami, lots of experimentation, and a bit of wall-head exchanges - I finally got the following to evaluate as it parsed:

#!/usr/bin/perl
use strict;
use warnings;
use Parse::RecDescent;

my %dispatch = (
    '+'    => sub { $_[0] +  $_[1] },
    '-'    => sub { $_[0] -  $_[1] },
    '*'    => sub { $_[0] *  $_[1] },
    '/'    => sub { $_[0] /  $_[1] },
    '^'    => sub { $_[0] ** $_[1] },
    'abs'  => sub { abs  $_[0] },
    'sqrt' => sub { sqrt $_[0] },
);

sub calculate {
    my $rule = shift @_;

    if ($rule eq 'FUNCTION') {
        my ($func, $x) = @_;
        my $val = eval { $dispatch{$func}->($x); };
        die $@ if $@;
        return $val;
    }

    my @atom = @{ shift @_ };
    my $val = shift @atom;
    while (@atom) {
        my ($op, $num) = splice(@atom, 0, 2);
        eval { $val = $dispatch{$op}->($val, $num); };
        die $@ if $@;
    }
    return $val;
}

my $grammar = <<'__GRAMMAR__';

evaluate        : EXPR /\Z/                                        { $
+item[1] }
EXPR            : ADD_SUB                                          { $
+item[1] }

ADD_SUB         :  <leftop: MUL_DIV_MOD  ADD_SUB_OP  MUL_DIV_MOD>  { m
+ain::calculate( @item ) }
ADD_SUB_OP      :  '+'  |  '-'

MUL_DIV_MOD     : <leftop: POW  MUL_DIV_MOD_OP  POW>               { m
+ain::calculate( @item ) }
MUL_DIV_MOD_OP  :  '*'  |  '/'  |  '%'

POW             :  <leftop: FUNCTION POW_OP FUNCTION>              { m
+ain::calculate( @item ) }
POW_OP          :  '^'

FUNCTION        : FUNC_NAME GROUP                                  { m
+ain::calculate( @item ) }
                  | GROUP

FUNC_NAME       : 'abs'  |  'sqrt'

GROUP           :  '('  EXPR  ')'                                  { $
+item[2] }
                   |  NUMBER

NUMBER          : FLOAT | INTEGER | NAN
INTEGER         : /[+-]?\d+/
FLOAT           : /([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?/
NAN             : /(Inf(inity)?|NaN)/i

__GRAMMAR__
my $parser = Parse::RecDescent->new($grammar) or die("Bad grammar\n");
my $answer = $parser->evaluate('11 - (4 + 4)^3 * sqrt(5 * (6 - 1)) + a
+bs(-3)');
print defined $answer ? $answer : 'Invalid expression';
[download]

Further Attempts:
While working with ikegami, he pointed out that it was possible to create a sane parse tree and save the evaluation until the end. While having a sane parse tree as a side-effect was not something I set out to do, I figured it worth investigating. I condensed his example to the following:

#!/usr/bin/perl
use strict;
use warnings;
use Parse::RecDescent;

my %eval = (
   disp => \&eval_dispatch,
   term => sub { $_[0] ->  [1] },
   '+'  => sub { $_[0] + $_[1] },
   '-'  => sub { $_[0] - $_[1] },
   '*'  => sub { $_[0] * $_[1] },
   '/'  => sub { $_[0] / $_[1] },
   '%'  => sub { $_[0] % $_[1] },
);

sub eval_node {
   local *_ = \$_[0];
   return $eval{disp}->($_->[0], $_);
}

sub eval_dispatch {
    my ($op, $node) = @_;
    return $eval{$op}->($node) if $op eq 'term';
    my $x = eval_node($node->[1]);
    my $y = eval_node($node->[2]);
    return $eval{$op}->($x, $y);
}

sub treeify {
    my $t = shift @_;
    $t = [ shift @_, $t, shift @_ ] while @_;
    return $t;
}

my $grammar = <<'__END_OF_GRAMMAR__';
   build   : expr /\Z/ { $item[1] }

   # Just an alias
   expr    : sum

   # vvv lowest precedence
   sum     : <leftop: prod SUM  prod> { main::treeify(@{$item[1]}) }
   prod    : <leftop: term PROD term> { main::treeify(@{$item[1]}) }
   # ^^^ highest precedence

   term    : '(' <commit> expr ')' { $item[3] }
           | UNSIGN_INT { [ @item ] }

   # Tokens
   UNSIGN_INT   : /\d+/

   SUM          : '+' | '-'
   PROD         : '*' | '/' | '%'
__END_OF_GRAMMAR__

my $parser = Parse::RecDescent->new($grammar) or die("Bad grammar\n");
my $tree = $parser->build('11 - 6 + 4');
my $eval = eval_node($tree);
print "$eval\n";
[download]

I believe I could adapt this technique to evaluate the unruly parse tree that I abandoned earlier but do not feel it is worth the effort. We (ikegami and I) also discussed returning blessed nodes which would greatly simplify the evaluation of the parse tree. Enough abstraction might make using Parse::RecDescent for this task tolerable. I can't help thinking that a different parsing method would be straight forward and pleasant without all the distractions.

I am not sure I am even going to attempt right-recursion and reversing the parse tree. This is a technique tye and I discussed in the chatterbox but my gut is telling me that Parse::RecDescent requires too much submission to be considered the right tool for this job. This contradicts what others I perceived to be more knowledgeable than myself told me. This is not to say that I am not completely blown away by the power and versatility of Parse::RecDescent. In fact, ikegami was inspired to write Operator Associativity and Eliminating Left-Recursion in Parse::RecDescent as a result of all my questions. It would have been nice if it existed before I dug in.

Next I tried Parse::Yapp. I had read that Parse::RecDescent and Parse::Yapp suffer from different sides of the same problem (right-recursive vs left-recursive). I do not believe this to be the case at all. Since all my math operators are left associative, except exponents, Parse::Yapp should make my task easier. This is an understatement! Parse::Yapp has precedence AND associativity built in. This code could obviously be abstracted further but it demonstrates how simple it is:

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Parse::Yapp;

my $grammar = join '', <DATA>;
my $parser  = Parse::Yapp->new(input => $grammar);
my $yapptxt = $parser->Output(classname => 'Calc');

eval $yapptxt; # normally written to a file

my $calc = Calc->new();
$calc->Ingest("11 - (4 + 4)^3 * sqrt(5 * (6 - 1)) + abs(-3)\n");
my $output = $calc->YYParse(yylex => \&Calc::Lexer);
print $output;

__DATA__
%left     '-' '+'
%left     '*' '/' '%'
%right    '^'
%nonassoc 'sqrt' 'abs'

%%
stack :  | stack expr '\n' { push @{$_[1]}, $_[2]; $_[1][0] };
expr  :    add | del | mul | div | mod | pow | grp | sqrt | abs | NUM;

add   : expr '+' expr { $_[1] +  $_[3] };
del   : expr '-' expr { $_[1] -  $_[3] };
mul   : expr '*' expr { $_[1] *  $_[3] };
div   : expr '/' expr { $_[1] /  $_[3] };
mod   : expr '%' expr { $_[1] %  $_[3] };
pow   : expr '^' expr { $_[1] ** $_[3] };
grp   :  '(' expr ')' {      $_[2]     };
abs   : 'abs' grp     {   abs($_[2])   };
sqrt  : 'sqrt' grp    {   sqrt($_[2])  };
%%
sub Lexer {
    my $parser = shift @_;
    local *_ = \$parser->YYData->{INPUT};

    s/^[ \t]+//;             # leading non-newline whitespace

    if (s/^(([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?)//) {
        return ('NUM', $1);  # borrowed from Scalar::Util
    }
    return ($1, $1) if s/^(sqrt|abs)//;
    return ($1, $1) if s/^(.)//s;
}

sub Ingest {
    my $self = shift @_;
    $self->YYData->{INPUT} = $_[0];
}
[download]

While I didn't experiment exhaustively, it appears that it doesn't matter what order I put anything in - the precedence and associativity declaration at the top makes it "just work". Parse::Yapp is not all good. You have to write your own lexer routine. Additionally, the documentation is not nearly as comprehensive as Parse::RecDescent. After discussing writing your own lexer in the CB with Corion, I am not as convinced having an auto-generated lexer is necessary. Since this is a context free grammar, the only problems you might run into is ordering things such that tokens that contain other tokens as a substring come first. I was going to see if perl::byacc might buy me an auto-generated lexer but I am not sure it is necessary now.

I tried to get into HOP::Parser next but the documentation refers you to 40 pages from the book. I own the book and have even skim read most of it to include chapter 8. I contacted Ovid to try and get a simple example showing precedence and associtivity that I could expand. He is interested but also busy. I think I am going to shelf HOP::Parser until I can either find the time to read the book in-depth or Ovid finds the time to provide an example.

Parse::Earley is Luke Palmer's implementation of Jay Earley's efficient context-free parsing algorithm (top-down). Even though it does not provide built-in precedence like Parse::Yapp (bottom-up), it handles right and left associativity without resorting to tricks. Additionally, it produces all possible parse trees when more than one are possible. This may or may not be a good thing if you are only after the "best".

While I came short of actually evaluating the expression, I am quite impressed with what Luke was able to put together in 1 week (literally). I did have some hangups with the POD and figuring out how to just tell the parser to consume all input as well as a few other things. I ended up abandoning it as there is no support for actions to either build your own tree or evaluate on the fly. The resulting tree may be easy to walk for evaluation but it seemed over my head.

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Parse::Earley;

my $parser = Parse::Earley->new();
my $grammar = <<'__GRAMMAR__';
input:   expr

expr:    expr '+' mul_term 
       | expr '-' mul_term
       | mul_term

mul_term:   mul_term '*' exp 
          | mul_term '/' exp
          | exp

exp:  term '^' exp
      | term

term:  '(' expr ')'
       | /\d+/ { $_ < 256 }
__GRAMMAR__
my $str = '1 + 2 - 3';
$parser->grammar($grammar);
$parser->start('input');
$parser->advance($str) for 1..6;
my ($tree) = $parser->matches_all($str, 'input');
print Dumper($tree);
[download]

Here is where I will talk about the other parsers, theory of the dragon book, and my lessons learned. If I don't run out of steam by now I want to see if I can write my own config language. The purpose of which would be to describe your operators functions precedence levels etc. The first pass would be to parse this config language and constructs the grammar to parse the language itself. This way, adding operators or precedence levels would only involve updating a config file.

Here is where I will draw my conclusions.

Ok, so I didn't actually finish the project but I did draw some conclusions. In no particular order:

People giving advice don't always listen:
When I indicated I was going to be writing a mathematical expression evaluator, I was told quite emphatically that Parse::RecDescent was the right tool for the job by several monks I respect. No warnings were given that handling precedence and associativity was tricky. Additionally, they seemed to be focused on the parsing part and not on the evaluation part. Creating a parse tree is one thing, using it is quite a different thing. I am quite convinced that Parse::RecDescent was the wrong tool for the job.

People giving advice are not always well informed:
I was told that Parse::RecDescent and Parse::Yapp were pretty much equal despite tackling the same problem in different ways (right-recursion vs left-recursion). From my limited experience, this is just plain wrong. If you need to handle precedence and/or associativity, Parse::Yapp is far superior.

Existing wheels don't always fit your car:
I did not try all the existing wheels I listed above but the ones I did try were clumsy and cumbersome. My handrolled version did exactly what I wanted. It took me less than an hour to write while the others took much longer.

You can learn a lot from breaking the rules:

Readmore tags added per jdporter's good suggestion.

Cheers - L~R

Back to Meditations