An adventure with 5.10 regular expressions

Shortly after perl 5.9.5 was released I was impressed by this presentation by Yves Orton, particularly grammars. When 5.10 was released I thought I'd play around a bit and learn how to use the new features, especially recursion and grammars. After a couple of false starts, mainly caused by not understanding the documentation, I decided on the goals.

  1. It had to be simple
  2. I could express the grammar as a Backus Normal Form
  3. I wanted to use recursion
  4. I would have fun
The obvious choice was a regular expression to match and evaluate Reverse Polish Notation (RPN). I still pine for my HP41C;-)

An easy start

The BNF is fairly straight forward.

expr := number | expr ws unary_operator | expr ws expr ws binary_operator number := digit {digit} [ '.' {digit} ] | '.' digit {digit} digit := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' unary_op := "chs" | "abs" | "sqr" | "sqrt" | "sin" | "cos" | "tan" binary_op := '+' | '-' | '*' | '/' | '^' ws := ' ' | '\t' | '\r' | '\n' Note: I use "chs" as shorthand for 'c' 'h' 's'

Translation from a BNF production to a perl regular expression is not difficult. Production rules have names and refer to each other by name, perl regexes now have named captures and they can fill the same role. For example

digit -> /(?<digit> \d )/msx number -> /(?<number> (?&digit)+ (?:[.](?&digit)*) | [.](?&digit)+ )/msx
in practice digit and ws are so simple I just inlined them.

This is the complete regular expression generated from that BNF to match an arbitrary RPN expression.

qr{ (?(DEFINE) (?<expr> (?&number) | (?&expr) \s+ (&unary_op) | (?&expr) \s+ (?&expr) \s+ (&binary_op) ) (?<number> \d+ (?: [.] \d* )? | [.] \d+ ) (?<unary_op> (?: chs | abs | sqr | sqrt | sin | cos | tan ) ) (?<binary_op> [-+/*^] ) ) \A \s* (?&expr) \s* \z }msx;
Feeling fairly pleased with myself I started to test it and second try, matching against '12 34 +', I got Infinite recursion in regex at rpg.pl line 136. I tinkered a bit sticking print statements into the regex. The regex gets stuck at position 0 and I couldn't figure out how to break the recursion. The problem is that I need some way to force the expr clause to never backtrack if the number alternative matches. For example when matching '12 chs' the number alternative of expr matches 12 but the whole fails because the end of the string hasn't been reached, it then tries expr ws unary_op, recurses back to expr and number grabs 12 but then fails because it hasn't got to the end of the string, so it backtracks out and it now attempts expr again, and again, and...

I probably should have paid more attention when the lecturer talked about top down parsers and LLR grammars. If any monk can shed light I will be most happy. As an aside, I discovered that say isn't recognized in a regex code block as I got this error

String found where operator expected at (re_eval 10) line 1, near "say + "matching at "" (Do you need to predeclare say?) syntax error at (re_eval 10) line 1, near "say "matching at "" Compilation failed in regexp at rpg.pl line 137.
It also gives an error for given/when, I didn't test state as that is too bizarre.

Backtracking regexes

It looked like I needed to rethink my approach. If I were writing an RPN evaluator from scratch, I'd

This still looks like a good match for regular expressions and I can reuse some of the definitions from the failed attempt.

My next regex was this

(?(DEFINE) (?<expr> (?&number) (?: \s* (?: (?&number) | (?&unary_op) | (?&binary_op) ) )* ) (?<number> (?> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ $^R + 1}) ) (?<unary_op> (?i: chs | abs | sqr | sqrt | sin | cos | tan ) (?![a-zA-Z]) ) (?<binary_op> [-+/*^] (?{ $^R - 1 }) ) ) \A \s* (?&expr) \s* \z (?(?{ $^R != 1 }) (*FAIL) ) }msx
The special variable $^R contains the value of the last (?{ code }) block executed. Is localized so that it's value is restored on backtracking. I use it to keep track of the number of numbers seen. This did indeed pass my tests.

A working calculator

Having got this far it would seem a simple matter to extract the tokens but this paragraph from perlre points out the difficulties ahead.

An example of how this might be used is as follows:
/(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT)) (?(DEFINE) (?<NAME_PAT>....) (?<ADRESS_PAT>....) )/x
Note that capture buffers matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing buffers is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be.

The entire RPN expression can be matched with /($RPN)/ but you can't access any of the internal matches made during a recursive regex. I wanted to somehow capture the internal matches and use them to evaluate the expression. There is a bit of work to do. Some points to remember

To keep track of numbers I used @stack and modified the number rule by adding capturing parentheses, $^N is the value of the last capture and the code section pushes it onto the stack.
(?<number> ( (?> \d+ (?: [.] \d* )? | [.] \d+ ) ) (?{ local @stack = ( @stack, $^N ) }) )
I made similar transformations of unary_op and binary_op. This is the complete regex.
qr{ (?(DEFINE) (?<expr> (?&number) (?: \s+ (?: (?&number) | (?&unary_op) | (?&binary_op) ) )* ) (?<number> ( \d+ (?: [.] \d* )? | [.] \d+ ) (?{ local @stack = ( @stack, $^N ) }) ) (?<unary_op> ( chs | abs | sqr | sqrt | sin | cos | tan ) (?(?{ @stack < 1 }) (*FAIL) ) (?{ local @stack = @stack; $operators{$^N}->($stack[-1]) }) ) (?<binary_op> ( [-+/*^] ) (?(?{ @stack < 2 }) (*FAIL) ) (?{ local @stack = @stack; $operators{$^N}->($stack[-2], pop @ +stack) }) ) ) \A \s* (?&expr) \s* \z (?(?{ @stack != 1 }) (*FAIL) ) (?{ $result = $stack[0] }) | (?{ $result = undef }) (*FAIL) }msx
Note the use of (*FAIL) clauses to check that the stack has the expected number of elements and, combined with alternation, to reset $result to undef if no match occurred.

For the sake of brevity (although I feel I can't be accused of that) I have omitted error checking. A divide by zero or square root of a negative number will cause the regex to die.

On reflection

It occurred to me later that I could eliminate recursion entirely, without recursion I didn't need the extra level of capturing parentheses as I could use $+{foo} instead. Also I could put the number | unary_op | binary_op clause in (?> ). Without backtracking I no longer need to localize @stack. I did have to ensure that @stack was initialized to () at the start of the regex. $result = undef was also moved to the start. I think it better to initialize variables before use rather than relying on them being put back after the last time (I have small children, I know it doesn't happen;)

qr{ (?{ @stack = (); $result = undef; }) \A \s* (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ push @stack, $+{start_number} }) (?> \s+ (?: (?<number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ push @stack, $+{number} }) | (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan ) (?(?{ @stack < 1 }) (*FAIL) ) (?{ $operators{ $+{unary_op} }->($stack[-1]) }) | (?<binary_op> [-+/*^] ) (?(?{ @stack < 2 }) (*FAIL) ) (?{ $operators{ $+{binary_op} }->($stack[-2], pop @stack) +}) ) )* \s* \z (?(?{ @stack != 1 }) (*FAIL) ) (?{ $result = $stack[0] }) }msx,
A small gotcha was that any expression with sqrt in it failed. The regex matched sqr and could not backtrack to try sqrt, the solution I used was to reverse the order of the two words, I could also have combined them as sqrt?.

Emma Chissit

A popular clash of Australian and English English occurred in 1964 when Monica Dickens, an English writer, was signing her latest book in Sydney. As told by the Sydney Morning Herald, 30 November 1964, the conversation between one of the female Australians in the queue and the author was as follows:
Aussie: Emma Chissit
Author, thinking that was her name, wrote "To Emma Chissit" in the book cover.
Aussie (speaking deliberately): No, emma chissit?
The australian was, of course, speaking strine and 'How much is it?' was the question.

To see what is the overhead of recursion I used Benchmark.

my $input = join ' ', (1) x 1_001, ('+') x 1_000; cmpthese( 100, { recursive => sub { $input1 =~ $RPN1 }, iterative => sub { $input2 =~ $RPN2 }, }, ); s/iter recursive flat recursive 1.35 -- -99% iterative 1.47e-02 9103% --
Recursion was more expensive than I had expected, Admittedly this is a pathalogical case where the stack becomes huge. Where the stack has at most 2 elements the difference is considerably less.
# my $input = join ' ', 1, ('1 +') x 1_000; Rate recursive iterative recursive 40.5/s -- -40% iterative 67.6/s 67% --
Even so a relatively small input and stack suffers.
# my $input = join ' ', (1) x 4, ('+') x 3; Rate recursive iterative recursive 14306/s -- -17% iterative 17271/s 21% --
Just to see how much the evaluation of the RPN expression affected the result I ran the a benchmark against the same data but with non evaluating regexes that use $^R to keep track of the number of numbers seen. That is the first working regex and this one adapted from the iterative version.
qr{ \A \s* (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ 1 }) (?> \s+ (?: (?<number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ $^R + 1}) | (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan ) | (?<binary_op> [-+/*^] ) (?{ $^R - 1 }) ) )* \s* \z (?(?{ $^R != 1 }) (*FAIL) ) }msx,
# my $input = join ' ', (1) x 1_001, ('+') x 1_000; Rate recursive iterative recursive 83.6/s -- -80% iterative 415/s 396% -- # my $input = join ' ', 1, ('1 +') x 1_000; Rate recursive iterative recursive 82.2/s -- -78% iterative 376/s 357% -- # my $input = join ' ', (1) x 4, ('+') x 3; Rate recursive iterative recursive 28249/s -- -69% iterative 92593/s 228% --
As you can see the difference is quite marked.

Here is a script that tests the regex.

#!//net/perl/5.10.0/bin/perl use warnings; use strict; use 5.010_000; our @stack; our $result; my %operators = ( '+' => sub { $_[0] += $_[1] }, '-' => sub { $_[0] -= $_[1] }, '/' => sub { $_[0] /= $_[1] }, '*' => sub { $_[0] *= $_[1] }, '^' => sub { $_[0]**= $_[1] }, chs => sub { $_[0] = -$_[0] }, abs => sub { $_[0] = abs $_[0] }, sqr => sub { $_[0] *= $_[0] }, sqrt => sub { $_[0] = sqrt $_[0] }, sin => sub { $_[0] = sin $_[0] }, cos => sub { $_[0] = cos $_[0] }, tan => sub { $_[0] = sin $_[0] / cos $_[0] }, ); my $RPN = qr{ (?{ @stack = () }) (?{ $result = undef }) \A \s* (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ push @stack, $+{start_number} }) (?> \s+ (?: (?<number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ push @stack, $+{number} }) | (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan ) (?(?{ @stack < 1 }) (*FAIL) ) (?{ $operators{ $+{unary_op} }->($stack[-1]) }) | (?<binary_op> [-+/*^] ) (?(?{ @stack < 2 }) (*FAIL) ) (?{ $operators{ $+{binary_op} }->($stack[-2], pop @stack) +}) ) )* \s* \z (?(?{ @stack != 1 }) (*FAIL) ) (?{ $result = $stack[0] }) }msx, my %input = ( 'fail' => undef, '1 fail' => undef, '123' => 123, '123.456' => 123.456, '123.456 +' => undef, '.456' => .456, '123.' => 123., '123 /' => undef, ' 123 ' => 123, '123 chs' => -123, '123 chs chs' => 123, '123 chs chs chs' => -123, '123 chs chs chs chs' => 123, '4 2 ^' => 16, '4 2 /' => 2, '4 2 +' => 6, '4 2 *' => 8, '4 2 -' => 2, '12 34+' => undef, '12 34' => undef, '12 34 + +' => undef, '12 34 56 + +' => 102, '12 34 + 56 +' => 102, '123 chs 34 *' => -4_182, '123 34 * chs' => -4_182, '24' => 24, '24 abs' => 24, '24abs' => undef, '24abssqrt' => undef, '1 24abs+sqrt' => undef, '1 24 abs + sqrt' => 5, '24 chs abs' => 24, '24 chs' => -24, '25 abs sqrt' => 5, '25 chs abs sqrt' => 5, '24 4 + chs sqr' => 784, '34 2 / 2 ^' => 289, '3 34 2 / 2 ^ *' => 867, ); foreach my $line ( keys %input ) { my ($expression) = $line =~ /($RPN)/; if ( !defined $result && !defined $input{$line} ) { say "OK: $line is invalid, got no result"; } elsif ( !defined $result ) { say "FAILED: $line expected $input{$line}, got no result"; } elsif ( !defined $input{$line} ) { say "FAILED: $line got $result, expected no result"; } elsif ( $result == $input{$line} ) { say "OK: $expression = $result"; } else { say "FAILED: $line got $result, expected $input{$line}"; } }

Final thoughts

Postscript

I started by looking for a good introdutory tutorial for the new features and, when I couldn't find anything, then to write a simple tutorial showing most of them. It didn't turn out that way but at the least I warn others of the pitfalls and, I hope, show a field worth exploring. With help I think I can write such an article so all criticism, tips and references will be appreciated. This is my first long posting so any suggestions regarding presentation are especially welcome.

References


In reply to An adventure with 5.10 regular expressions by hipowls

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.