Re: An adventure with 5.10 regular expressions

I finally got back to this thread!

A couple of quick comments first:

I wanted to use recursion [...] The obvious choice was a regular expression to match and evaluate Reverse Polish Notation (RPN).

That's actually a very poor choice. The advantage of RPN is that it can be parsed with neither backtracking nor recursion. Parsing RPN can be simplified to checking the size of a stack.

It occurred to me later that I could eliminate recursion entirely,

What should have occurred to you is that you had already eliminated recursion entirely.

I feel you left the tracks you intended to visit. You did so when you replaced the recursion with stack checks (e.g. (?(?{ @stack < 1 }) (*FAIL) )). At that point, you stopped using the regexp engine as a parser since you moved all but the tokenizing back into Perl. Since the intent was to develop a parser as a regexp pattern, let's revisit the recursive method.

First, let's go back to the EBNF grammar.

expr      := number
           | expr ws unary_op 
           | expr ws expr ws binary_op
number    := digit {digit} [ '.' {digit} ]
           | '.' digit {digit}
digit     := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
unary_op  := "chs" | "abs" | "sqr" | "sqrt" | "sin" | "cos" | "tan"
binary_op := '+' | '-' | '*' | '/' | '^'
ws        := ' ' | '\t' | '\r' | '\n'
[download]

If you tried to build a regexp from that, you'd get

Infinite recursion in regex
[download]

When something other than a number is found while trying to match an expr, it tries to match an expr.
It finds something other than a number, so it tries to match an expr.
It finds something other than a number, so it tries to match an expr.
...

This type of parser is known as a LL parser. Left recursion ("rule : rule ..." or something that reduces to that) must be eliminated from the rules of a LL parser. This is to what I was alluding in my earlier post.

expr      := number
           | expr ws unary_op 
           | expr ws expr ws binary_op
[download]

becomes

expr      := number expr_
expr_     := ws expr ws binary_op expr_
           | ws unary_op expr_
           |
[download]

At this point, building the regexp is straightforward:

qr{
   (?(DEFINE)
      (?<expr>
         (?&number) (?&expr_)
      )

      (?<expr_>
         (?:
            \s++
            (?:
               (?&expr) \s++ (?&binary_op) (?&expr_)
            |  
               (?&unary_op) (?&expr_)
            )
         )?
      )

      (?<number>
         ( (?&NUMBER) )
         (?{
            local @stack = @stack;
            push @stack, $^N;
         })
      )

      (?<unary_op>
         ( (?&IDENT) )
         (?(?{ !exists($unary_ops{$^N}) }) (*FAIL) )
         (?{
            local @stack = @stack;
            $unary_ops{$^N}->($stack[-1]);
         })
      )

      (?<binary_op>
         ( [-+/*^] )
         (?{
            local @stack = @stack;
            $binary_ops{$^N}->($stack[-2], pop(@stack));
         })
      )


      # Tokens.
      # Backtracking can't cause a token
      # rule to return something different.

      (?<NUMBER> \d++ (?: [.] \d*+ )?+ | [.] \d++ )
      (?<IDENT>  [_A-Za-z][_A-Za-z0-9]*+          )

   )

   (?{ local @stack = (); })
   \A \s* (?&expr) \s* \z
   (?{ $result = pop(@stack) })
}sx;
[download]

Notice the complete lack of stack size check? The grammar enforces the right number of arguments, preventing us from having an invalid stack.

By the way, getting an identifier and checking if it's one of the operators is a much more robust then just looking for the operator names. It also results in better error messages. It isn't strictly needed here, but this is an educational venture, right?

I made some small changes to the Perl interface, so the harness needs to be changed slightly:

#!/usr/bin/perl

use 5.010_000;  # re features

use warnings;
use strict;

our @stack;
our $result;

my %unary_ops = (
    chs  => sub { $_[0] = -$_[0] },
    abs  => sub { $_[0] = abs $_[0] },
    sqr  => sub { $_[0] *= $_[0] },
    sqrt => sub { $_[0] = sqrt $_[0] },
    sin  => sub { $_[0] = sin $_[0] },
    cos  => sub { $_[0] = cos $_[0] },
    tan  => sub { $_[0] = sin $_[0] / cos $_[0] },
);

my %binary_ops = (
    '+' => sub { $_[0] += $_[1] },
    '-' => sub { $_[0] -= $_[1] },
    '/' => sub { $_[0] /= $_[1] },
    '*' => sub { $_[0] *= $_[1] },
    '^'  => sub { $_[0]**= $_[1] },
);

my $RPN = qr{
   ...
};

my %input = (
   ...
);

foreach my $line ( keys %input ) {
    local $result;
    my ($expression) = $line =~ /($RPN)/;
    ...
}
[download]

Results:

OK: .456 = .456
OK: 123. = 123.
OK: 4 2 + = 6
OK: 123 / is invalid, got no result
OK: 24 abs = 24
OK: 1 24 abs + sqrt = 5
OK: 12 34 is invalid, got no result
OK: 123 34 * chs = -4182
OK: 4 2 - = 2
OK: 34 2 / 2 ^ = 289
OK: 24 4 + chs sqr = 784
OK: 123.456 + is invalid, got no result
OK: 123 chs 34 * = -4182
OK: fail is invalid, got no result
OK: 24 chs abs = 24
OK: 25 abs sqrt = 5
OK: 12 34+ is invalid, got no result
OK: 12 34 + + is invalid, got no result
OK: 123 chs chs chs = -123
OK: 1 24abs+sqrt is invalid, got no result
OK: 12 34 + 56 + = 102
OK: 123 chs = -123
OK: 24 = 24
OK: 1 fail is invalid, got no result
OK: 12 34 56 + + = 102
OK: 4 2 ^ = 16
OK: 123 chs chs chs chs = 123
OK: 3 34 2 / 2 ^ * = 867
OK: 25 chs abs sqrt = 5
OK: 123.456 = 123.456
OK:  123  = 123
OK: 123 chs chs = 123
OK: 24 chs = -24
OK: 123 = 123
OK: 24abs is invalid, got no result
OK: 4 2 / = 2
OK: 24abssqrt is invalid, got no result
OK: 4 2 * = 8
[download]

Comment on Re: An adventure with 5.10 regular expressions Select or Download Code