Re: Operator Precedence Parser

That is sort of funny. What you have described is nearly the same as what I have implemented for the operator parser in CGI::Ex::Template (CGI::Ex::Template is part of the CGI::Ex suite and implements a full fledged TT2 engine and much of the planned TT3 spec).

To summarize:

I parse chains of operators and variables into an array - so

  a + b.c - 1.3 * 2 + 3
[download]

would show up as:

  @tree = qw(a + b.c - 1.3 * 2 + 3);
[download]

As I am parsing I also create a hash of found operators and I note their precedence which I get from a table (it is possible to add any amount of other operators to the table). This hash for this example would look like

  %found = ('+' => 85, '-' => 85, '*' => 90)
[download]

I then call a method called "apply_precedence" passing it the tree and the %found hash. Apply precedence takes the highest precedence operator and splits the @tree array into sub trees whenever it finds an operator. Each of those sub trees recursively calls "apply_precedence" until each sub tree only has one element. The returned elements are placed in an execution optree that looks like '+', 'a', 'b.c'. The previous example would parse down to something like (but with a little bit different syntax for encoding the parsed variables, operators and arguments to expressions:

  ['*', ['+', 'a', ['-', 'b.c', 1.3]], ['+', 2, 3]]
[download]

As a side note, a template language that I wrote for my current company starts with this syntax so the parser doesn't have to worry about precedence.

I would like to know if there is a formal name for this type of parsing. The rest of the parser is recursive in nature, but just sort of flattens out when it reaches an operator.

The grammar for this parser is hard coded - but is essentially contained with in the parse_variable, parse_args, and apply_precedence methods (parse_variable should more aptly be called parse_expression as it handles variables, numbers, var or num + var or num, parens, {} and () constructs and string literals. I am always interested in seeing somebody write something faster so if somebody would like to hack away - they are certainly welcome to. Apply precedence also has logic to allow for it to correctly split up ternary and nested ternary operators. Writing this also has helped me see that there may be ways to optimize things even more.

The perldoc of CGI::Ex::Template includes a section on variables and how they are stored. To quote a large section:

=head1 VARIABLE PARSE TREE

CGI::Ex::Template parses templates into an tree of operations.  Even
variable access is parsed into a tree.  This is done in a manner
somewhat similar to the way that TT operates except that nested
variables such as foo.bar|baz contain the '.' or '|' in between each
name level.  Operators are parsed and stored as part of the variable (
+it
may be more appropriate to say we are parsing a term or an expression)
+.

The following table shows a variable or expression and the correspondi
+ng parsed tree
(this is what the parse_variable method would return).

    one                [ 'one',  0 ]
    one()              [ 'one',  [] ]
    one.two            [ 'one',  0, '.', 'two',  0 ]
    one|two            [ 'one',  0, '|', 'two',  0 ]
    one.$two           [ 'one',  0, '.', ['two', 0 ], 0 ]
    one(two)           [ 'one',  [ ['two', 0] ] ]
    one.${two().three} [ 'one',  0, '.', ['two', [], '.', 'three', 0],
+ 0]
    2.34               2.34
    "one"              "one"
    "one"|length       [ \"one", 0, '|', 'length', 0 ]
    "one $a two"       [ \ [ '~', 'one ', ['a', 0], ' two' ], 0 ]
    [0, 1, 2]          [ \ [ 'array', 0, 1, 2 ], 0 ]
    [0, 1, 2].size     [ \ [ 'array', 0, 1, 2 ], 0, '.', 'size', 0 ]
    ['a', a, $a ]      [ \ [ 'array', 'a', ['a', 0], [['a', 0], 0] ], 
+0]
    {a  => 'b'}        [ \ [ 'hash',  'a', 'b' ], 0 ]
    {a  => 'b'}.size   [ \ [ 'hash',  'a', 'b' ], 0, '.', 'size', 0 ]
    {$a => b}          [ \ [ 'hash',  ['a', 0], ['b', 0] ], 0 ]
    1 + 2              [ \ [ '+', 1, 2 ], 0]
    a + b              [ \ [ '+', ['a', 0], ['b', 0] ], 0 ]
    a * (b + c)        [ \ [ '*', ['a', 0], [ \ ['+', ['b', 0], ['c', 
+0]], 0 ]], 0 ]
    (a + b)            [ \ [ '+', ['a', 0], ['b', 0] ]], 0 ]
    (a + b) * c        [ \ [ '*', [ \ [ '+', ['a', 0], ['b', 0] ], 0 ]
+, ['c', 0] ], 0 ]
    a ? b : c          [ \ [ '?', ['a', 0], ['b', 0], ['c', 0] ], 0 ]
    a || b || c        [ \ [ '||', ['a', 0], [ \ [ '||', ['b', 0], ['c
+', 0] ], 0 ] ], 0 ]
    ! a                [ \ [ '!', ['a', 0] ], 0 ]

Some notes on the parsing.

    Operators are parsed as part of the variable and become part of th
+e variable tree.

    Operators are stored in the variable tree using a reference to the
+ arrayref - which
    allows for quickly descending the parsed variable tree and determi
+ning that the next
    node is an operator.

    Parenthesis () can be used at any point in an expression to disamb
+iguate precedence.

    "Variables" that appear to be literal strings or literal numbers
    are returned as the literal (no operator tree).

The following perl can be typed at the command line to view the parsed
+ variable tree:

    perl -e 'use CGI::Ex::Template; print CGI::Ex::Template::dump_pars
+e("foo.bar + 2")."\n"'

Also the following can be included in a template to view the output in
+ a template:

    [% USE cet = CGI::Ex::Template %]
    [%~ cet.dump_parse('foo.bar + 2').replace('\s+', ' ') %]
[download]

The following is the long - but useful operators tree (I recently cleaned up the operator table to be a little more clear for version 2.02 which the head version on CPAN). The table is used to generate the regexes necessary for parsing operators.

$OPERATORS = [
    # type      precedence symbols              action (undef means pl
+ay_operator will handle)
    ['prefix',  98,        ['++'],              undef                 
+                      ],
    ['prefix',  98,        ['--'],              undef                 
+                      ],
    ['postfix', 98,        ['++'],              undef                 
+                      ],
    ['postfix', 98,        ['--'],              undef                 
+                      ],
    ['infix',   96,        ['**', 'pow'],       sub {     $_[0] ** $_[
+1]                  } ],
    ['prefix',  93,        ['!'],               sub {   ! $_[0]       
+                    } ],
    ['prefix',  93,        ['-'],               sub { @_ == 1 ? 0 - $_
+[0] : $_[0] - $_[1] } ],
    ['infix',   90,        ['*'],               sub {     $_[0] *  $_[
+1]                  } ],
    ['infix',   90,        ['/'],               sub {     $_[0] /  $_[
+1]                  } ],
    ['infix',   90,        ['div', 'DIV'],      sub { int($_[0] /  $_[
+1])                 } ],
    ['infix',   90,        ['%', 'mod', 'MOD'], sub {     $_[0] %  $_[
+1]                  } ],
    ['infix',   85,        ['+'],               sub {     $_[0] +  $_[
+1]                  } ],
    ['infix',   85,        ['-'],               sub { @_ == 1 ? 0 - $_
+[0] : $_[0] - $_[1] } ],
    ['infix',   85,        ['~', '_'],          sub { join "", @_     
+                    } ],
    ['infix',   80,        ['<'],               sub {     $_[0] <  $_[
+1]                  } ],
    ['infix',   80,        ['>'],               sub {     $_[0] >  $_[
+1]                  } ],
    ['infix',   80,        ['<='],              sub {     $_[0] <= $_[
+1]                  } ],
    ['infix',   80,        ['>='],              sub {     $_[0] >= $_[
+1]                  } ],
    ['infix',   80,        ['lt'],              sub {     $_[0] lt $_[
+1]                  } ],
    ['infix',   80,        ['gt'],              sub {     $_[0] gt $_[
+1]                  } ],
    ['infix',   80,        ['le'],              sub {     $_[0] le $_[
+1]                  } ],
    ['infix',   80,        ['ge'],              sub {     $_[0] ge $_[
+1]                  } ],
    ['infix',   75,        ['==', 'eq'],        sub {     $_[0] eq $_[
+1]                  } ],
    ['infix',   75,        ['!=', 'ne'],        sub {     $_[0] ne $_[
+1]                  } ],
    ['infix',   70,        ['&&'],              undef                 
+                      ],
    ['infix',   65,        ['||'],              undef                 
+                      ],
    ['infix',   60,        ['..'],              sub {     $_[0] .. $_[
+1]                  } ],
    ['ternary', 55,        ['?', ':'],          undef                 
+                      ],
    ['assign',  53,        ['+='],              sub {     $_[0] +  $_[
+1]                  } ],
    ['assign',  53,        ['-='],              sub {     $_[0] -  $_[
+1]                  } ],
    ['assign',  53,        ['*='],              sub {     $_[0] *  $_[
+1]                  } ],
    ['assign',  53,        ['/='],              sub {     $_[0] /  $_[
+1]                  } ],
    ['assign',  53,        ['%='],              sub {     $_[0] %  $_[
+1]                  } ],
    ['assign',  53,        ['**='],             sub {     $_[0]**  $_[
+1]                  } ],
    ['assign',  53,        ['~=', '_='],        sub {     $_[0] .  $_[
+1]                  } ],
    ['assign',  52,        ['='],               undef                 
+                      ],
    ['prefix',  50,        ['not', 'NOT'],      sub {   ! $_[0]       
+                    } ],
    ['infix',   45,        ['and', 'AND'],      undef                 
+                      ],
    ['infix',   40,        ['or', 'OR'],        undef                 
+                      ],
];
[download]

Again - it is sort of funny to see the same ideas discovered and rediscovered over and over.

my @a=qw(random brilliant braindead); print $a[rand(@a)];

Comment on Re: Operator Precedence Parser Select or Download Code

Replies are listed 'Best First'.
Re^2: Operator Precedence Parser by ikegami (Patriarch) on Jun 09, 2006 at 21:14 UTC
What about associativity? Is `6 / 5 * 4` equal to `(6 / 5) * 4` (like Perl) or to `6 / (5 * 4)`? Is `2 3 4` equal to `(2 3) 4` or to `2 (3 4)` (like Perl)? Deviances from Perl's behaviour should be documented, IMHO. For fun since it doesn't really matter, something else to look at is the interaction between unary minus, post- and pre- -decrement and -increment. For example, `a---b` `a----b` `a-----b`	[reply] [d/l] [select]
Re^3: Operator Precedence Parser by Rhandom (Curate) on Jun 09, 2006 at 22:27 UTC
Thank you for the reply. I tested to see using the following template. <h2>6 / 5 * 4</h2> CET: [% 6 / 5 * 4 %]<br> Perl: [% PERL %]print 6 / 5 * 4 [% END %]<br> <hr> <h2>2 3 4</h2> CET: [% 2 3 4 %]<br> Perl: [% PERL %]print 2 3 4[% END %]<br> <hr> <h2>a---b</h2> CET: [% a = 5; b = 2 %][% a---b %]<br> Perl: [% PERL %]$a = 5; $b = 2; print $a---$b[% END %]<br> <hr> <h2>a--- -b</h2> CET: [% a = 5; b = 2 %][% a--- -b %]<br> Perl: [% PERL %]$a = 5; $b = 2; print $a--- -$b[% END %]<br> <hr> <h2>a--- --b</h2> CET: [% a = 5; b = 2 %][% a--- --b %]<br> Perl: [% PERL %]$a = 5; $b = 2; print $a--- --$b[% END %]<br> <hr> [download] It printed out: `6 / 5 * 4 CET: 0.3 Perl: 4.8 2 3 4 CET: 2.41785163922926e+24 Perl: 2.41785163922926e+24 a---b CET: 3 Perl: 3 a--- -b CET: 7 Perl: 7 a--- --b CET: 4 Perl: 4` [download] So - it looks like I need to fix my right vs left vs non-associative. I'll add that to the table and change the parser (it is always doing right right now - it used to always do left - it will be trivial and won't even cause a speed hit to allow it to do both). Thank you - I knew about precedence and precedence makes complete sense - associativity rules seem like they are a little more arbitrary and it seems to cry foul to the user that it isn't consistent (such is legacy). In the perl6 operators table they don't even mention if the operator group is right or left (though it probably does elsewhere in the doc). So - the infix type will go away in CGI::Ex::Template and will be replaced by left right and non. It needs to match perl. I'll get that out hopefully soon. Thanks again. Oh - also to note - the a----b and a-----b examples are both parse errors because the longest term rule makes them try and find a prefix -- immediately after the postfix --. Adding a space after the first three - disabiguates both cases. my @a=qw(random brilliant braindead); print $a[rand(@a)];	[reply] [d/l] [select]
Re^4: Operator Precedence Parser by Rhandom (Curate) on Jun 12, 2006 at 12:52 UTC
Well I had posted late-late saturday night that the fix was out in version 2.03 - but it was late enough I forgot to commit the post instead of just preview. So - version 2.03 has the fixes. And as a bonus it helped to make the code more readable too. my @a=qw(random brilliant braindead); print $a[rand(@a)];	[reply]