comment on

An adventure with 5.10 regular expressions

Shortly after perl 5.9.5 was released I was impressed by this presentation by Yves Orton, particularly grammars. When 5.10 was released I thought I'd play around a bit and learn how to use the new features, especially recursion and grammars. After a couple of false starts, mainly caused by not understanding the documentation, I decided on the goals.

It had to be simple
I could express the grammar as a Backus Normal Form
I wanted to use recursion
I would have fun

The obvious choice was a regular expression to match and evaluate Reverse Polish Notation (RPN). I still pine for my HP41C;-)

An easy start

The BNF is fairly straight forward.

expr      := number
           | expr ws unary_operator 
           | expr ws expr ws binary_operator
number    := digit {digit} [ '.' {digit} ]
           | '.' digit {digit}
digit     := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
unary_op  := "chs" | "abs" | "sqr" | "sqrt" | "sin" | "cos" | "tan"
binary_op := '+' | '-' | '*' | '/' | '^'
ws        := ' ' | '\t' | '\r' | '\n'

Note: I use "chs" as shorthand for 'c' 'h' 's'
[download]

Translation from a BNF production to a perl regular expression is not difficult. Production rules have names and refer to each other by name, perl regexes now have named captures and they can fill the same role. For example

digit  -> /(?<digit> \d )/msx
number -> /(?<number> (?&digit)+ (?:[.](?&digit)*)
                |     [.](?&digit)+ )/msx
[download]

in practice digit and ws are so simple I just inlined them.

This is the complete regular expression generated from that BNF to match an arbitrary RPN expression.

qr{
(?(DEFINE)
    (?<expr>
        (?&number)
    |   (?&expr) \s+ (&unary_op)
    |   (?&expr) \s+ (?&expr) \s+ (&binary_op)
    )
    (?<number>
        \d+ (?: [.] \d* )?
    |   [.] \d+
    )
    (?<unary_op>
        (?: chs | abs | sqr | sqrt | sin | cos | tan )
    )
    (?<binary_op>
        [-+/*^]
    )
)
    \A \s* (?&expr) \s* \z
}msx;
[download]

Feeling fairly pleased with myself I started to test it and second try, matching against '12 34 +', I got Infinite recursion in regex at rpg.pl line 136. I tinkered a bit sticking print statements into the regex. The regex gets stuck at position 0 and I couldn't figure out how to break the recursion. The problem is that I need some way to force the expr clause to never backtrack if the number alternative matches. For example when matching '12 chs' the number alternative of expr matches 12 but the whole fails because the end of the string hasn't been reached, it then tries expr ws unary_op, recurses back to expr and number grabs 12 but then fails because it hasn't got to the end of the string, so it backtracks out and it now attempts expr again, and again, and...

I probably should have paid more attention when the lecturer talked about top down parsers and LLR grammars. If any monk can shed light I will be most happy. As an aside, I discovered that say isn't recognized in a regex code block as I got this error

String found where operator expected at (re_eval 10) line 1, near "say
+ "matching at ""
    (Do you need to predeclare say?)
syntax error at (re_eval 10) line 1, near "say "matching at ""
Compilation failed in regexp at rpg.pl line 137.
[download]

It also gives an error for given/when, I didn't test state as that is too bizarre.

Backtracking regexes

It looked like I needed to rethink my approach. If I were writing an RPN evaluator from scratch, I'd

parse the string from the start
pick out the tokens
check the first token is a number
push numbers onto a stack
apply unary operators to the top of the stack
pop the top element and apply a binary operator to it and the top of the stack
check the stack had exactly one number at the end of the input

This still looks like a good match for regular expressions and I can reuse some of the definitions from the failed attempt.

My next regex was this

(?(DEFINE)
    (?<expr>
        (?&number)
        (?:
            \s*
            (?:
                (?&number)
            |   (?&unary_op)
            |   (?&binary_op)
            )
        )*
    )
    (?<number>
        (?> \d+ (?: [.] \d* )? | [.] \d+ )
        (?{ $^R + 1})
    )
    (?<unary_op>
        (?i: chs | abs | sqr | sqrt | sin | cos | tan ) 
        (?![a-zA-Z])
    )
    (?<binary_op>
        [-+/*^]
        (?{ $^R - 1 })
    )
)

    \A \s* (?&expr) \s* \z
    (?(?{ $^R != 1 }) (*FAIL) ) 
}msx
[download]

The special variable $^R contains the value of the last (?{ code }) block executed. Is localized so that it's value is restored on backtracking. I use it to keep track of the number of numbers seen. This did indeed pass my tests.

A working calculator

Having got this far it would seem a simple matter to extract the tokens but this paragraph from perlre points out the difficulties ahead.

An example of how this might be used is as follows:
  /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
      (?(DEFINE)
          (?<NAME_PAT>....)
          (?<ADRESS_PAT>....)
      )/x
[download]
Note that capture buffers matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing buffers is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be.

The entire RPN expression can be matched with /($RPN)/ but you can't access any of the internal matches made during a recursive regex. I wanted to somehow capture the internal matches and use them to evaluate the expression. There is a bit of work to do. Some points to remember

Lexical variables don't reliably work with recursive regexes, need to use globals
Using local when modifying a variable restores the value when backtracking occurs
Global variables need to be in a known good state before starting the match, local helps here as the original value is restored on regex exit. You could reset them before using the regex but it seems better practice to let the regex do it.

To keep track of numbers I used @stack and modified the number rule by adding capturing parentheses, $^N is the value of the last capture and the code section pushes it onto the stack.

    (?<number>
        ( (?> \d+ (?: [.] \d* )? | [.] \d+ ) )
        (?{ local @stack = ( @stack, $^N ) })
    )
[download]

I made similar transformations of unary_op and binary_op. This is the complete regex.

qr{
(?(DEFINE)
    (?<expr>
        (?&number)
        (?:
            \s+
            (?:
                (?&number)
            |   (?&unary_op)
            |   (?&binary_op)
            )
        )*
    )
    (?<number>
        ( \d+ (?: [.] \d* )? | [.] \d+ )
        (?{ local @stack = ( @stack, $^N ) })
    )
    (?<unary_op>
        ( chs | abs | sqr | sqrt | sin | cos | tan )
        (?(?{ @stack < 1 }) (*FAIL) )
        (?{ local @stack = @stack; $operators{$^N}->($stack[-1]) })
    )
    (?<binary_op>
        ( [-+/*^] )
        (?(?{ @stack < 2 }) (*FAIL) )
        (?{ local @stack = @stack; $operators{$^N}->($stack[-2], pop @
+stack) })
    )
)

    \A \s* (?&expr) \s* \z
    (?(?{ @stack != 1 }) (*FAIL) )
    (?{ $result = $stack[0] })
    |
    (?{ $result = undef }) (*FAIL)

}msx
[download]

Note the use of (*FAIL) clauses to check that the stack has the expected number of elements and, combined with alternation, to reset $result to undef if no match occurred.

For the sake of brevity (although I feel I can't be accused of that) I have omitted error checking. A divide by zero or square root of a negative number will cause the regex to die.

On reflection

It occurred to me later that I could eliminate recursion entirely, without recursion I didn't need the extra level of capturing parentheses as I could use $+{foo} instead. Also I could put the number | unary_op | binary_op clause in (?> ). Without backtracking I no longer need to localize @stack. I did have to ensure that @stack was initialized to () at the start of the regex. $result = undef was also moved to the start. I think it better to initialize variables before use rather than relying on them being put back after the last time (I have small children, I know it doesn't happen;)

qr{
    (?{ @stack  = ();
        $result = undef;
    })
    \A \s*
    (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ )
    (?{ push @stack, $+{start_number} })
    (?>
        \s+
        (?:
            (?<number> \d+ (?: [.] \d* )? | [.] \d+ )
            (?{ push @stack, $+{number} })
        |
            (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan )
            (?(?{ @stack < 1 }) (*FAIL) )
            (?{ $operators{ $+{unary_op} }->($stack[-1]) })
        |
            (?<binary_op> [-+/*^] )
            (?(?{ @stack < 2 }) (*FAIL) )
            (?{ $operators{ $+{binary_op} }->($stack[-2], pop @stack) 
+})
        )
    )*
    \s* \z
    (?(?{ @stack != 1 }) (*FAIL) )
    (?{ $result = $stack[0] })
}msx,
[download]

A small gotcha was that any expression with sqrt in it failed. The regex matched sqr and could not backtrack to try sqrt, the solution I used was to reverse the order of the two words, I could also have combined them as sqrt?.

Emma Chissit

A popular clash of Australian and English English occurred in 1964 when Monica Dickens, an English writer, was signing her latest book in Sydney. As told by the Sydney Morning Herald, 30 November 1964, the conversation between one of the female Australians in the queue and the author was as follows:
Aussie: Emma Chissit
Author, thinking that was her name, wrote "To Emma Chissit" in the book cover.
Aussie (speaking deliberately): No, emma chissit?

The australian was, of course, speaking strine and 'How much is it?' was the question.

To see what is the overhead of recursion I used Benchmark.

my $input = join ' ', (1) x 1_001, ('+') x 1_000;

cmpthese(
    100,
    {
        recursive => sub { $input1 =~ $RPN1 },
        iterative => sub { $input2 =~ $RPN2 },
    },
);
            s/iter recursive      flat
recursive     1.35        --      -99%
iterative 1.47e-02     9103%        --
[download]

Recursion was more expensive than I had expected, Admittedly this is a pathalogical case where the stack becomes huge. Where the stack has at most 2 elements the difference is considerably less.

# my $input = join ' ', 1, ('1 +') x 1_000;
            Rate recursive iterative
recursive 40.5/s        --      -40%
iterative 67.6/s       67%        --
[download]

Even so a relatively small input and stack suffers.

# my $input = join ' ', (1) x 4, ('+') x 3;
             Rate recursive iterative
recursive 14306/s        --      -17%
iterative 17271/s       21%        --
[download]

Just to see how much the evaluation of the RPN expression affected the result I ran the a benchmark against the same data but with non evaluating regexes that use $^R to keep track of the number of numbers seen. That is the first working regex and this one adapted from the iterative version.

qr{
    \A \s*
    (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ )
    (?{ 1 })
    (?>
        \s+
        (?:
            (?<number> \d+ (?: [.] \d* )? | [.] \d+ )
            (?{ $^R + 1})
        |
            (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan )
        |
            (?<binary_op> [-+/*^] )
            (?{ $^R - 1 })
        )
    )*
    \s* \z
    (?(?{ $^R != 1 }) (*FAIL) )
}msx,
[download]

# my $input = join ' ', (1) x 1_001, ('+') x 1_000;
            Rate recursive iterative
recursive 83.6/s        --      -80%
iterative  415/s      396%        --

# my $input = join ' ', 1, ('1 +') x 1_000;
            Rate recursive iterative
recursive 82.2/s        --      -78%
iterative  376/s      357%        --

# my $input = join ' ', (1) x 4, ('+') x 3;
             Rate recursive iterative
recursive 28249/s        --      -69%
iterative 92593/s      228%        --
[download]

As you can see the difference is quite marked.

Here is a script that tests the regex.

#!//net/perl/5.10.0/bin/perl
use warnings;
use strict;

use 5.010_000;

our @stack;
our $result;

my %operators = (
    '+' => sub { $_[0] += $_[1] },
    '-' => sub { $_[0] -= $_[1] },
    '/' => sub { $_[0] /= $_[1] },
    '*' => sub { $_[0] *= $_[1] },
    '^'  => sub { $_[0]**= $_[1] },
    chs  => sub { $_[0] = -$_[0] },
    abs  => sub { $_[0] = abs $_[0] },
    sqr  => sub { $_[0] *= $_[0] },
    sqrt => sub { $_[0] = sqrt $_[0] },
    sin  => sub { $_[0] = sin $_[0] },
    cos  => sub { $_[0] = cos $_[0] },
    tan  => sub { $_[0] = sin $_[0] / cos $_[0] },
);

my $RPN = qr{
    (?{ @stack = () })
    (?{ $result = undef })
    \A \s*
    (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ )
    (?{ push @stack, $+{start_number} })
    (?>
        \s+
        (?:
            (?<number> \d+ (?: [.] \d* )? | [.] \d+ )
            (?{ push @stack, $+{number} })
        |
            (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan )
            (?(?{ @stack < 1 }) (*FAIL) )
            (?{ $operators{ $+{unary_op} }->($stack[-1]) })
        |
            (?<binary_op> [-+/*^] )
            (?(?{ @stack < 2 }) (*FAIL) )
            (?{ $operators{ $+{binary_op} }->($stack[-2], pop @stack) 
+})
        )
    )*
    \s* \z
    (?(?{ @stack != 1 }) (*FAIL) )
    (?{ $result = $stack[0] })
}msx,

my %input = (
    'fail'                => undef,
    '1 fail'              => undef,
    '123'                 => 123,
    '123.456'             => 123.456,
    '123.456 +'           => undef,
    '.456'                => .456,
    '123.'                => 123.,
    '123 /'               => undef,
    ' 123 '               => 123,
    '123 chs'             => -123,
    '123 chs chs'         => 123,
    '123 chs chs chs'     => -123,
    '123 chs chs chs chs' => 123,
    '4 2 ^'               => 16,
    '4 2 /'               => 2,
    '4 2 +'               => 6,
    '4 2 *'               => 8,
    '4 2 -'               => 2,
    '12 34+'              => undef,
    '12 34'               => undef,
    '12 34 + +'           => undef,
    '12 34 56 + +'        => 102,
    '12 34 + 56 +'        => 102,
    '123 chs 34 *'        => -4_182,
    '123 34 * chs'        => -4_182,
    '24'                  => 24,
    '24 abs'              => 24,
    '24abs'               => undef,
    '24abssqrt'           => undef,
    '1 24abs+sqrt'        => undef,
    '1 24 abs + sqrt'     => 5,
    '24 chs abs'          => 24,
    '24 chs'              => -24,
    '25 abs sqrt'         => 5,
    '25 chs abs sqrt'     => 5,
    '24 4 + chs sqr'      => 784,
    '34 2 / 2 ^'          => 289,
    '3 34 2 / 2 ^ *'      => 867,
);

foreach my $line ( keys %input ) {
    my ($expression) = $line =~ /($RPN)/;
    if ( !defined $result && !defined $input{$line} ) {
        say "OK: $line is invalid, got no result";
    }
    elsif ( !defined $result ) {
        say "FAILED: $line expected $input{$line}, got no result";
    }
    elsif ( !defined $input{$line} ) {
        say "FAILED: $line got $result, expected no result";
    }
    elsif ( $result == $input{$line} ) {
        say "OK: $expression = $result";
    }
    else {
        say "FAILED: $line got $result, expected $input{$line}";
    }
}
[download]

Final thoughts

regextidy would make my day.
A correct BNF is useful even if you can't turn it into a working regular expression.
The new regular features are powerful.
I don't yet completely understand how to use them.
Capturing matches during recursion requires hand crafting.
Practice makes, if not perfect, then better.
Recursion is expensive.
I had fun;-)
regextidy would really make my day.

Postscript

I started by looking for a good introdutory tutorial for the new features and, when I couldn't find anything, then to write a simple tutorial showing most of them. It didn't turn out that way but at the least I warn others of the pitfalls and, I hope, show a field worth exploring. With help I think I can write such an article so all criticism, tips and references will be appreciated. This is my first long posting so any suggestions regarding presentation are especially welcome.

References

In reply to An adventure with 5.10 regular expressions by hipowls

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.