in reply to An adventure with 5.10 regular expressions

I finally got back to this thread!


A couple of quick comments first:

I wanted to use recursion [...] The obvious choice was a regular expression to match and evaluate Reverse Polish Notation (RPN).

That's actually a very poor choice. The advantage of RPN is that it can be parsed with neither backtracking nor recursion. Parsing RPN can be simplified to checking the size of a stack.

It occurred to me later that I could eliminate recursion entirely,

What should have occurred to you is that you had already eliminated recursion entirely.


I feel you left the tracks you intended to visit. You did so when you replaced the recursion with stack checks (e.g. (?(?{ @stack < 1 }) (*FAIL) )). At that point, you stopped using the regexp engine as a parser since you moved all but the tokenizing back into Perl. Since the intent was to develop a parser as a regexp pattern, let's revisit the recursive method.

First, let's go back to the EBNF grammar.

expr := number | expr ws unary_op | expr ws expr ws binary_op number := digit {digit} [ '.' {digit} ] | '.' digit {digit} digit := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' unary_op := "chs" | "abs" | "sqr" | "sqrt" | "sin" | "cos" | "tan" binary_op := '+' | '-' | '*' | '/' | '^' ws := ' ' | '\t' | '\r' | '\n'

If you tried to build a regexp from that, you'd get

Infinite recursion in regex

When something other than a number is found while trying to match an expr, it tries to match an expr.
It finds something other than a number, so it tries to match an expr.
It finds something other than a number, so it tries to match an expr.
...

This type of parser is known as a LL parser. Left recursion ("rule : rule ..." or something that reduces to that) must be eliminated from the rules of a LL parser. This is to what I was alluding in my earlier post.

expr := number | expr ws unary_op | expr ws expr ws binary_op

becomes

expr := number expr_ expr_ := ws expr ws binary_op expr_ | ws unary_op expr_ |

At this point, building the regexp is straightforward:

qr{ (?(DEFINE) (?<expr> (?&number) (?&expr_) ) (?<expr_> (?: \s++ (?: (?&expr) \s++ (?&binary_op) (?&expr_) | (?&unary_op) (?&expr_) ) )? ) (?<number> ( (?&NUMBER) ) (?{ local @stack = @stack; push @stack, $^N; }) ) (?<unary_op> ( (?&IDENT) ) (?(?{ !exists($unary_ops{$^N}) }) (*FAIL) ) (?{ local @stack = @stack; $unary_ops{$^N}->($stack[-1]); }) ) (?<binary_op> ( [-+/*^] ) (?{ local @stack = @stack; $binary_ops{$^N}->($stack[-2], pop(@stack)); }) ) # Tokens. # Backtracking can't cause a token # rule to return something different. (?<NUMBER> \d++ (?: [.] \d*+ )?+ | [.] \d++ ) (?<IDENT> [_A-Za-z][_A-Za-z0-9]*+ ) ) (?{ local @stack = (); }) \A \s* (?&expr) \s* \z (?{ $result = pop(@stack) }) }sx;

Notice the complete lack of stack size check? The grammar enforces the right number of arguments, preventing us from having an invalid stack.

By the way, getting an identifier and checking if it's one of the operators is a much more robust then just looking for the operator names. It also results in better error messages. It isn't strictly needed here, but this is an educational venture, right?

I made some small changes to the Perl interface, so the harness needs to be changed slightly:

#!/usr/bin/perl use 5.010_000; # re features use warnings; use strict; our @stack; our $result; my %unary_ops = ( chs => sub { $_[0] = -$_[0] }, abs => sub { $_[0] = abs $_[0] }, sqr => sub { $_[0] *= $_[0] }, sqrt => sub { $_[0] = sqrt $_[0] }, sin => sub { $_[0] = sin $_[0] }, cos => sub { $_[0] = cos $_[0] }, tan => sub { $_[0] = sin $_[0] / cos $_[0] }, ); my %binary_ops = ( '+' => sub { $_[0] += $_[1] }, '-' => sub { $_[0] -= $_[1] }, '/' => sub { $_[0] /= $_[1] }, '*' => sub { $_[0] *= $_[1] }, '^' => sub { $_[0]**= $_[1] }, ); my $RPN = qr{ ... }; my %input = ( ... ); foreach my $line ( keys %input ) { local $result; my ($expression) = $line =~ /($RPN)/; ... }

Results:

OK: .456 = .456 OK: 123. = 123. OK: 4 2 + = 6 OK: 123 / is invalid, got no result OK: 24 abs = 24 OK: 1 24 abs + sqrt = 5 OK: 12 34 is invalid, got no result OK: 123 34 * chs = -4182 OK: 4 2 - = 2 OK: 34 2 / 2 ^ = 289 OK: 24 4 + chs sqr = 784 OK: 123.456 + is invalid, got no result OK: 123 chs 34 * = -4182 OK: fail is invalid, got no result OK: 24 chs abs = 24 OK: 25 abs sqrt = 5 OK: 12 34+ is invalid, got no result OK: 12 34 + + is invalid, got no result OK: 123 chs chs chs = -123 OK: 1 24abs+sqrt is invalid, got no result OK: 12 34 + 56 + = 102 OK: 123 chs = -123 OK: 24 = 24 OK: 1 fail is invalid, got no result OK: 12 34 56 + + = 102 OK: 4 2 ^ = 16 OK: 123 chs chs chs chs = 123 OK: 3 34 2 / 2 ^ * = 867 OK: 25 chs abs sqrt = 5 OK: 123.456 = 123.456 OK: 123 = 123 OK: 123 chs chs = 123 OK: 24 chs = -24 OK: 123 = 123 OK: 24abs is invalid, got no result OK: 4 2 / = 2 OK: 24abssqrt is invalid, got no result OK: 4 2 * = 8