Shortly after perl 5.9.5 was released I was impressed by this presentation by Yves Orton, particularly grammars. When 5.10 was released I thought I'd play around a bit and learn how to use the new features, especially recursion and grammars. After a couple of false starts, mainly caused by not understanding the documentation, I decided on the goals.
The BNF is fairly straight forward.
expr := number | expr ws unary_operator | expr ws expr ws binary_operator number := digit {digit} [ '.' {digit} ] | '.' digit {digit} digit := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' unary_op := "chs" | "abs" | "sqr" | "sqrt" | "sin" | "cos" | "tan" binary_op := '+' | '-' | '*' | '/' | '^' ws := ' ' | '\t' | '\r' | '\n' Note: I use "chs" as shorthand for 'c' 'h' 's'
Translation from a BNF production to a perl regular expression is not difficult. Production rules have names and refer to each other by name, perl regexes now have named captures and they can fill the same role. For example
in practice digit and ws are so simple I just inlined them.digit -> /(?<digit> \d )/msx number -> /(?<number> (?&digit)+ (?:[.](?&digit)*) | [.](?&digit)+ )/msx
This is the complete regular expression generated from that BNF to match an arbitrary RPN expression.
Feeling fairly pleased with myself I started to test it and second try, matching against '12 34 +', I got Infinite recursion in regex at rpg.pl line 136. I tinkered a bit sticking print statements into the regex. The regex gets stuck at position 0 and I couldn't figure out how to break the recursion. The problem is that I need some way to force the expr clause to never backtrack if the number alternative matches. For example when matching '12 chs' the number alternative of expr matches 12 but the whole fails because the end of the string hasn't been reached, it then tries expr ws unary_op, recurses back to expr and number grabs 12 but then fails because it hasn't got to the end of the string, so it backtracks out and it now attempts expr again, and again, and...qr{ (?(DEFINE) (?<expr> (?&number) | (?&expr) \s+ (&unary_op) | (?&expr) \s+ (?&expr) \s+ (&binary_op) ) (?<number> \d+ (?: [.] \d* )? | [.] \d+ ) (?<unary_op> (?: chs | abs | sqr | sqrt | sin | cos | tan ) ) (?<binary_op> [-+/*^] ) ) \A \s* (?&expr) \s* \z }msx;
I probably should have paid more attention when the lecturer talked about top down parsers and LLR grammars. If any monk can shed light I will be most happy. As an aside, I discovered that say isn't recognized in a regex code block as I got this error
It also gives an error for given/when, I didn't test state as that is too bizarre.
It looked like I needed to rethink my approach. If I were writing an RPN evaluator from scratch, I'd
My next regex was this
The special variable $^R contains the value of the last (?{ code }) block executed. Is localized so that it's value is restored on backtracking. I use it to keep track of the number of numbers seen. This did indeed pass my tests.(?(DEFINE) (?<expr> (?&number) (?: \s* (?: (?&number) | (?&unary_op) | (?&binary_op) ) )* ) (?<number> (?> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ $^R + 1}) ) (?<unary_op> (?i: chs | abs | sqr | sqrt | sin | cos | tan ) (?![a-zA-Z]) ) (?<binary_op> [-+/*^] (?{ $^R - 1 }) ) ) \A \s* (?&expr) \s* \z (?(?{ $^R != 1 }) (*FAIL) ) }msx
Having got this far it would seem a simple matter to extract the tokens but this paragraph from perlre points out the difficulties ahead.
An example of how this might be used is as follows:Note that capture buffers matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing buffers is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be./(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT)) (?(DEFINE) (?<NAME_PAT>....) (?<ADRESS_PAT>....) )/x
The entire RPN expression can be matched with /($RPN)/ but you can't access any of the internal matches made during a recursive regex. I wanted to somehow capture the internal matches and use them to evaluate the expression. There is a bit of work to do. Some points to remember
I made similar transformations of unary_op and binary_op. This is the complete regex.(?<number> ( (?> \d+ (?: [.] \d* )? | [.] \d+ ) ) (?{ local @stack = ( @stack, $^N ) }) )
Note the use of (*FAIL) clauses to check that the stack has the expected number of elements and, combined with alternation, to reset $result to undef if no match occurred.qr{ (?(DEFINE) (?<expr> (?&number) (?: \s+ (?: (?&number) | (?&unary_op) | (?&binary_op) ) )* ) (?<number> ( \d+ (?: [.] \d* )? | [.] \d+ ) (?{ local @stack = ( @stack, $^N ) }) ) (?<unary_op> ( chs | abs | sqr | sqrt | sin | cos | tan ) (?(?{ @stack < 1 }) (*FAIL) ) (?{ local @stack = @stack; $operators{$^N}->($stack[-1]) }) ) (?<binary_op> ( [-+/*^] ) (?(?{ @stack < 2 }) (*FAIL) ) (?{ local @stack = @stack; $operators{$^N}->($stack[-2], pop @ +stack) }) ) ) \A \s* (?&expr) \s* \z (?(?{ @stack != 1 }) (*FAIL) ) (?{ $result = $stack[0] }) | (?{ $result = undef }) (*FAIL) }msx
For the sake of brevity (although I feel I can't be accused of that) I have omitted error checking. A divide by zero or square root of a negative number will cause the regex to die.
It occurred to me later that I could eliminate recursion entirely, without recursion I didn't need the extra level of capturing parentheses as I could use $+{foo} instead. Also I could put the number | unary_op | binary_op clause in (?> ). Without backtracking I no longer need to localize @stack. I did have to ensure that @stack was initialized to () at the start of the regex. $result = undef was also moved to the start. I think it better to initialize variables before use rather than relying on them being put back after the last time (I have small children, I know it doesn't happen;)
A small gotcha was that any expression with sqrt in it failed. The regex matched sqr and could not backtrack to try sqrt, the solution I used was to reverse the order of the two words, I could also have combined them as sqrt?.qr{ (?{ @stack = (); $result = undef; }) \A \s* (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ push @stack, $+{start_number} }) (?> \s+ (?: (?<number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ push @stack, $+{number} }) | (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan ) (?(?{ @stack < 1 }) (*FAIL) ) (?{ $operators{ $+{unary_op} }->($stack[-1]) }) | (?<binary_op> [-+/*^] ) (?(?{ @stack < 2 }) (*FAIL) ) (?{ $operators{ $+{binary_op} }->($stack[-2], pop @stack) +}) ) )* \s* \z (?(?{ @stack != 1 }) (*FAIL) ) (?{ $result = $stack[0] }) }msx,
A popular clash of Australian and English English occurred in 1964 when Monica Dickens, an English writer, was signing her latest book in Sydney. As told by the Sydney Morning Herald, 30 November 1964, the conversation between one of the female Australians in the queue and the author was as follows:Aussie: Emma ChissitThe australian was, of course, speaking strine and 'How much is it?' was the question.
Author, thinking that was her name, wrote "To Emma Chissit" in the book cover.
Aussie (speaking deliberately): No, emma chissit?
To see what is the overhead of recursion I used Benchmark.
Recursion was more expensive than I had expected, Admittedly this is a pathalogical case where the stack becomes huge. Where the stack has at most 2 elements the difference is considerably less.my $input = join ' ', (1) x 1_001, ('+') x 1_000; cmpthese( 100, { recursive => sub { $input1 =~ $RPN1 }, iterative => sub { $input2 =~ $RPN2 }, }, ); s/iter recursive flat recursive 1.35 -- -99% iterative 1.47e-02 9103% --
Even so a relatively small input and stack suffers.# my $input = join ' ', 1, ('1 +') x 1_000; Rate recursive iterative recursive 40.5/s -- -40% iterative 67.6/s 67% --
Just to see how much the evaluation of the RPN expression affected the result I ran the a benchmark against the same data but with non evaluating regexes that use $^R to keep track of the number of numbers seen. That is the first working regex and this one adapted from the iterative version.# my $input = join ' ', (1) x 4, ('+') x 3; Rate recursive iterative recursive 14306/s -- -17% iterative 17271/s 21% --
qr{ \A \s* (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ 1 }) (?> \s+ (?: (?<number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ $^R + 1}) | (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan ) | (?<binary_op> [-+/*^] ) (?{ $^R - 1 }) ) )* \s* \z (?(?{ $^R != 1 }) (*FAIL) ) }msx,
As you can see the difference is quite marked.# my $input = join ' ', (1) x 1_001, ('+') x 1_000; Rate recursive iterative recursive 83.6/s -- -80% iterative 415/s 396% -- # my $input = join ' ', 1, ('1 +') x 1_000; Rate recursive iterative recursive 82.2/s -- -78% iterative 376/s 357% -- # my $input = join ' ', (1) x 4, ('+') x 3; Rate recursive iterative recursive 28249/s -- -69% iterative 92593/s 228% --
Here is a script that tests the regex.
#!//net/perl/5.10.0/bin/perl use warnings; use strict; use 5.010_000; our @stack; our $result; my %operators = ( '+' => sub { $_[0] += $_[1] }, '-' => sub { $_[0] -= $_[1] }, '/' => sub { $_[0] /= $_[1] }, '*' => sub { $_[0] *= $_[1] }, '^' => sub { $_[0]**= $_[1] }, chs => sub { $_[0] = -$_[0] }, abs => sub { $_[0] = abs $_[0] }, sqr => sub { $_[0] *= $_[0] }, sqrt => sub { $_[0] = sqrt $_[0] }, sin => sub { $_[0] = sin $_[0] }, cos => sub { $_[0] = cos $_[0] }, tan => sub { $_[0] = sin $_[0] / cos $_[0] }, ); my $RPN = qr{ (?{ @stack = () }) (?{ $result = undef }) \A \s* (?<start_number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ push @stack, $+{start_number} }) (?> \s+ (?: (?<number> \d+ (?: [.] \d* )? | [.] \d+ ) (?{ push @stack, $+{number} }) | (?<unary_op> chs | abs | sqrt | sqr | sin | cos | tan ) (?(?{ @stack < 1 }) (*FAIL) ) (?{ $operators{ $+{unary_op} }->($stack[-1]) }) | (?<binary_op> [-+/*^] ) (?(?{ @stack < 2 }) (*FAIL) ) (?{ $operators{ $+{binary_op} }->($stack[-2], pop @stack) +}) ) )* \s* \z (?(?{ @stack != 1 }) (*FAIL) ) (?{ $result = $stack[0] }) }msx, my %input = ( 'fail' => undef, '1 fail' => undef, '123' => 123, '123.456' => 123.456, '123.456 +' => undef, '.456' => .456, '123.' => 123., '123 /' => undef, ' 123 ' => 123, '123 chs' => -123, '123 chs chs' => 123, '123 chs chs chs' => -123, '123 chs chs chs chs' => 123, '4 2 ^' => 16, '4 2 /' => 2, '4 2 +' => 6, '4 2 *' => 8, '4 2 -' => 2, '12 34+' => undef, '12 34' => undef, '12 34 + +' => undef, '12 34 56 + +' => 102, '12 34 + 56 +' => 102, '123 chs 34 *' => -4_182, '123 34 * chs' => -4_182, '24' => 24, '24 abs' => 24, '24abs' => undef, '24abssqrt' => undef, '1 24abs+sqrt' => undef, '1 24 abs + sqrt' => 5, '24 chs abs' => 24, '24 chs' => -24, '25 abs sqrt' => 5, '25 chs abs sqrt' => 5, '24 4 + chs sqr' => 784, '34 2 / 2 ^' => 289, '3 34 2 / 2 ^ *' => 867, ); foreach my $line ( keys %input ) { my ($expression) = $line =~ /($RPN)/; if ( !defined $result && !defined $input{$line} ) { say "OK: $line is invalid, got no result"; } elsif ( !defined $result ) { say "FAILED: $line expected $input{$line}, got no result"; } elsif ( !defined $input{$line} ) { say "FAILED: $line got $result, expected no result"; } elsif ( $result == $input{$line} ) { say "OK: $expression = $result"; } else { say "FAILED: $line got $result, expected $input{$line}"; } }
I started by looking for a good introdutory tutorial for the new features and, when I couldn't find anything, then to write a simple tutorial showing most of them. It didn't turn out that way but at the least I warn others of the pitfalls and, I hope, show a field worth exploring. With help I think I can write such an article so all criticism, tips and references will be appreciated. This is my first long posting so any suggestions regarding presentation are especially welcome.
|
|---|