http://qs1969.pair.com?node_id=1189633


in reply to Regular Expression Test

#!/usr/bin/perl # http://perlmonks.org/?node_id=1189605 use strict; use warnings; $| = 1; sub err { die "ERROR: ", s/\G/<@_>/r, "\n" } sub crossmap { my ($left, $right) = @_; [ map { my $prefix = $_ ; map $prefix.$_, @$right } @$left ]; } sub expr { my $answer = [ '' ]; $answer = /\G(\d+)/gc ? crossmap $answer, [ $1 ] : /\G\|/gc ? [ @$answer, @{ expr() } ] : /\G\[(\d+)\]/gc ? crossmap $answer, [ split //, $1 ] : /\G\(/gc ? ( crossmap($answer, expr()), /\G\)/gc || err "missing ')'")[0] : return $answer while 1; } while(<DATA>) { chomp; print "$_ => "; my $answer = expr(); /\G\z/gc or err "incomplete parse"; local $" = ','; print "@$answer\n"; } __DATA__ (65|70) (3[678]|4[1678]) 5[45] (6[4569]|7[01])

Hey, it works for all your test cases :)

UPDATE: typo fixed

Replies are listed 'Best First'.
Re^2: Regular Expression Test
by Discipulus (Canon) on May 06, 2017 at 12:44 UTC
    tybalt89++

    I'd like to understand it.. ;=)

    Is this a lex parser as noted in perlre ?

    It just support the test cases? or can be extended to support more regex operators?

    I was trying a raw approach but when i saw your i gave up.

    Brilliant regex hacker!

    UPDATE: my attempt with the simplicistic data sample proposed in OP (show representative data as corollary to Know your data principle), was something like:

    use strict; use warnings; while(<DATA>){ chomp; my $rx; eval{ $rx = qr($_) }; if($@){print "-->$_<-- INVALID: ",lc $@=~s/$0 line \d+,//r; next +} print "$rx\t=>\t"; s/[\(\)]//g; foreach my $part (split /\|/){ if ($part=~/(\d+)?\[\d+\]/){ print join ' ', grep{/$rx/}$1.0 .. $1.9; print ' '; } else{print "$part "} } print "\n"; } __DATA__ (65|70) (3999[678]|4[1678]) 5[45] (6[4569]|7[01]) (((broken (65|70) # out (?^:(65|70)) => 65 70 (?^:(3999[678]|4[1678])) => 39996 39997 39998 41 46 47 48 (?^:5[45]) => 54 55 (?^:(6[4569]|7[01])) => 64 65 66 69 70 71 -->(((broken<-- INVALID: unmatched ( in regex; marked by <-- here in m +/((( <-- here broken/ at <data> line 5. (?^:(65|70)) => 65 70

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      This is not just a "lex"-like scanner as noted in perlre, but a fully recursive parser (note the calls to expr() inside some of the actions). It will properly handle multiple levels of parentheses.

      The advice I would give is to instrument it up with debug prints to see what's happening, also adding a recursion level counter to track recursive calls.

      It is loosely based on this expression parser I wrote a while ago:

      #!/usr/bin/perl use strict; # mini.pl - modified Pratt parser by tybalt89 use warnings; # https://en.wikipedia.org/wiki/Pratt_parser sub error { die "ERROR ", s/\G/ <@_> /r, "\n" } sub expr # two statement parser - precedences: (3 **) (2 * /) (1 + -) { my $answer = /\G\s* ((?:\d+(?:\.\d*)?|\.\d+)(e[+-]?\d+)?) /gcxi ? $1 : /\G\s*\(/gc ? (expr(0), /\G\s*\)/gc || error 'missing )')[0] : /\G\s* - /gcx ? -expr(3) : # unary minus /\G\s* \+ /gcx ? +expr(3) : # unary plus error 'bad operand'; $answer = $_[0] <= 3 && /\G\s* \*\* /gcx ? $answer ** expr(3) : $_[0] <= 2 && /\G\s* \* /gcx ? $answer * expr(3) : $_[0] <= 2 && /\G\s* \/ /gcx ? $answer / expr(3) : $_[0] <= 1 && /\G\s* \+ /gcx ? $answer + expr(2) : $_[0] <= 1 && /\G\s* \- /gcx ? $answer - expr(2) : return $answer while 1; } for ( @ARGV ? @ARGV : scalar <> ) # source as commandline args or stdi +n { my $answer = expr(0); /\G\s*\z/gc ? print s/\s*\z/ = $answer\n/r : error 'incomplete parse +'; }

      which was also the basis for this Re: Parsing Boolean expressions

      but it doesn't need the precedence stuff.

        Tybalt, this arithmatic evaluator of your is much simpler and more useful than what I've managed to find on CPAN.

        I suggest you create a simpler interface for it - such as my $ans = expr($expression); - then make it in to a CPAN module.

      Here's the same code with one debug print added to show what's going on at the beginning of each time around the "while 1" loop.

      #!/usr/bin/perl # http://perlmonks.org/?node_id=1189605 use strict; use warnings; $| = 1; sub err { die "ERROR: ", s/\G/<@_>/r, "\n" } sub crossmap { my ($left, $right) = @_; [ map { my $prefix = $_ ; map $prefix.$_, @$right } @$left ]; } sub expr { my $answer = [ '' ]; print("\ndebug ", s/\G/<<you are here with [@$answer]>>/r, " end " +), $answer = /\G\|/gc ? [ @$answer, @{ expr() } ] : /\G\d+/gc ? crossmap $answer, [ $& ] : /\G\[\d+\]/gc ? crossmap $answer, [ $& =~ /\d/g ] : /\G\(/gc ? ( crossmap($answer, expr()), /\G\)/gc || err "missing ')'")[0] : return $answer while 1; } while(<DATA>) { chomp; print "$_ => "; my $answer = expr(); /\G\z/gc or err "incomplete parse"; local $" = ','; print "@$answer\n"; } __DATA__ (65|70) (3[678]|4[1678]) 5[45] (6[4569]|7[01])

      It prints the following ( showing both where the current position in the input is (the \G) and the contents of the variable $answer at the time).:

      (65|70) => debug <<you are here with []>>(65|70) end debug (<<you are here with []>>65|70) end debug (65<<you are here with [65]>>|70) end debug (65|<<you are here with []>>70) end debug (65|70<<you are here with [70]>>) end debug (65|70<<you are here with [65 70]>>) end debug (65|70)<<you are here with [65 70]>> end 65,70 (3[678]|4[1678]) => debug <<you are here with []>>(3[678]|4[1678]) end debug (<<you are here with []>>3[678]|4[1678]) end debug (3<<you are here with [3]>>[678]|4[1678]) end debug (3[678]<<you are here with [36 37 38]>>|4[1678]) end debug (3[678]|<<you are here with []>>4[1678]) end debug (3[678]|4<<you are here with [4]>>[1678]) end debug (3[678]|4[1678]<<you are here with [41 46 47 48]>>) end debug (3[678]|4[1678]<<you are here with [36 37 38 41 46 47 48]>>) end + debug (3[678]|4[1678])<<you are here with [36 37 38 41 46 47 48]>> end + 36,37,38,41,46,47,48 5[45] => debug <<you are here with []>>5[45] end debug 5<<you are here with [5]>>[45] end debug 5[45]<<you are here with [54 55]>> end 54,55 (6[4569]|7[01]) => debug <<you are here with []>>(6[4569]|7[01]) end debug (<<you are here with []>>6[4569]|7[01]) end debug (6<<you are here with [6]>>[4569]|7[01]) end debug (6[4569]<<you are here with [64 65 66 69]>>|7[01]) end debug (6[4569]|<<you are here with []>>7[01]) end debug (6[4569]|7<<you are here with [7]>>[01]) end debug (6[4569]|7[01]<<you are here with [70 71]>>) end debug (6[4569]|7[01]<<you are here with [64 65 66 69 70 71]>>) end debug (6[4569]|7[01])<<you are here with [64 65 66 69 70 71]>> end 64 +,65,66,69,70,71

      I hope this is helpful. At each debug output you can see what the /\G.../gc has stepped over and what the new value of $answer is.

      My problem with explaining this type of parser is that I have been working with parsers like this for well over a year, and it all comes as second nature to me. If you have more specific questions I'll be willing to take a shot at answering them.

      Hi Discipulus

      tybalt89 is combining the \G meta with //gc modifiers

      • \G achors the search at the pos where the last regex left off
      • unless you used the //c modifier, the pos of a previous regex was erased when it didn't match
      This allows to split a big complicated regex into multiple perl statements.

      And after combining with recursive calls this permits to parse nested structures.

      See perldocs for examples and details of this technique (e.g. perlretut ) or Friedl's Regex Book.

      Or even better older threads in the monastery :)

      You shouldn't be despaired, the demonstrated code is neither easy to read nor maintain.

      It should better use //x modifiers and more comments, but obfuscated wizardry is of course cooler! ;-))

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        Could be worse :)

        #!/usr/bin/perl # http://perlmonks.org/?node_id=1189605 use strict; use warnings; $| = 1; sub err { die "ERROR: ", s/\G/<@_>/r, "\n" } sub crossmap # this is clear, isn't it? { my ($left, $right) = @_; [ map { my $prefix = $_ ; map $prefix.$_, @$right } @$left ]; } sub expr { my $answer = [ '' ]; while( /\G ( \| (?{ 'ALTERNATION' }) | \d+ (?{ 'STRING' }) | \[ \d+ \] (?{ 'CHARACTERCLASS' }) | \( (?{ 'PARENTHESIS' }) ) /gcx ) { goto $^R; ALTERNATION: $answer = [ @$answer, @{ expr() } ]; next; STRING: $answer = crossmap $answer, [ $1 ]; next; CHARACTERCLASS: $answer = crossmap $answer, [ $1 =~ /\d/g ]; next; PARENTHESIS: $answer = crossmap $answer, expr(); /\G\)/gc or err "missing ')'"; } return $answer; } while(<DATA>) { chomp; print "$_ => "; my $answer = expr(); /\G\z/gc or err "incomplete parse"; local $" = ','; print "@$answer\n"; } __DATA__ (65|70) (3[678]|4[1678]) 5[45] (6[4569]|7[01])
        You shouldn't be despaired, the demonstrated code is neither easy to read nor maintain.

        Depends on who's reading it, and who's maintaining it. If so.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'