Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Regular Expression Test

by leoberbert (Novice)
on May 05, 2017 at 19:58 UTC ( #1189605=perlquestion: print w/replies, xml ) Need Help??

leoberbert has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I would like to know if there is any module or a form so that I can test all the possibility of a regular expression according to the example below: My file contains a column with all regular expressions.
(65|70) (3[678]|4[1678]) 5[45] (6[4569]|7[01])
Expected output.
(65|70) => 65,70 (3[678]|4[1678]) => 36,37,38,41,46,47,48 5[45] => 54,55 (6[4569]|7[01]) => 64,65,66,69,70,71
Regards

Replies are listed 'Best First'.
Re: Regular Expression Test
by AnomalousMonk (Bishop) on May 05, 2017 at 22:06 UTC

    As Laurent_R noted, the general form of this problem is potentially very difficult. The usual approach is to restrict as much as possible the regex operators to be supported. For an interesting (and possibly even helpful!) discussion of the problem, see Higher Order Perl, (free download here), section 6.5 "Regex String Generation".


    Give a man a fish:  <%-{-{-{-<

Re: Regular Expression Test
by tybalt89 (Prior) on May 06, 2017 at 00:31 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1189605 use strict; use warnings; $| = 1; sub err { die "ERROR: ", s/\G/<@_>/r, "\n" } sub crossmap { my ($left, $right) = @_; [ map { my $prefix = $_ ; map $prefix.$_, @$right } @$left ]; } sub expr { my $answer = [ '' ]; $answer = /\G(\d+)/gc ? crossmap $answer, [ $1 ] : /\G\|/gc ? [ @$answer, @{ expr() } ] : /\G\[(\d+)\]/gc ? crossmap $answer, [ split //, $1 ] : /\G\(/gc ? ( crossmap($answer, expr()), /\G\)/gc || err "missing ')'")[0] : return $answer while 1; } while(<DATA>) { chomp; print "$_ => "; my $answer = expr(); /\G\z/gc or err "incomplete parse"; local $" = ','; print "@$answer\n"; } __DATA__ (65|70) (3[678]|4[1678]) 5[45] (6[4569]|7[01])

    Hey, it works for all your test cases :)

    UPDATE: typo fixed

      tybalt89++

      I'd like to understand it.. ;=)

      Is this a lex parser as noted in perlre ?

      It just support the test cases? or can be extended to support more regex operators?

      I was trying a raw approach but when i saw your i gave up.

      Brilliant regex hacker!

      UPDATE: my attempt with the simplicistic data sample proposed in OP (show representative data as corollary to Know your data principle), was something like:

      use strict; use warnings; while(<DATA>){ chomp; my $rx; eval{ $rx = qr($_) }; if($@){print "-->$_<-- INVALID: ",lc $@=~s/$0 line \d+,//r; next +} print "$rx\t=>\t"; s/[\(\)]//g; foreach my $part (split /\|/){ if ($part=~/(\d+)?\[\d+\]/){ print join ' ', grep{/$rx/}$1.0 .. $1.9; print ' '; } else{print "$part "} } print "\n"; } __DATA__ (65|70) (3999[678]|4[1678]) 5[45] (6[4569]|7[01]) (((broken (65|70) # out (?^:(65|70)) => 65 70 (?^:(3999[678]|4[1678])) => 39996 39997 39998 41 46 47 48 (?^:5[45]) => 54 55 (?^:(6[4569]|7[01])) => 64 65 66 69 70 71 -->(((broken<-- INVALID: unmatched ( in regex; marked by <-- here in m +/((( <-- here broken/ at <data> line 5. (?^:(65|70)) => 65 70

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

        This is not just a "lex"-like scanner as noted in perlre, but a fully recursive parser (note the calls to expr() inside some of the actions). It will properly handle multiple levels of parentheses.

        The advice I would give is to instrument it up with debug prints to see what's happening, also adding a recursion level counter to track recursive calls.

        It is loosely based on this expression parser I wrote a while ago:

        #!/usr/bin/perl use strict; # mini.pl - modified Pratt parser by tybalt89 use warnings; # https://en.wikipedia.org/wiki/Pratt_parser sub error { die "ERROR ", s/\G/ <@_> /r, "\n" } sub expr # two statement parser - precedences: (3 **) (2 * /) (1 + -) { my $answer = /\G\s* ((?:\d+(?:\.\d*)?|\.\d+)(e[+-]?\d+)?) /gcxi ? $1 : /\G\s*\(/gc ? (expr(0), /\G\s*\)/gc || error 'missing )')[0] : /\G\s* - /gcx ? -expr(3) : # unary minus /\G\s* \+ /gcx ? +expr(3) : # unary plus error 'bad operand'; $answer = $_[0] <= 3 && /\G\s* \*\* /gcx ? $answer ** expr(3) : $_[0] <= 2 && /\G\s* \* /gcx ? $answer * expr(3) : $_[0] <= 2 && /\G\s* \/ /gcx ? $answer / expr(3) : $_[0] <= 1 && /\G\s* \+ /gcx ? $answer + expr(2) : $_[0] <= 1 && /\G\s* \- /gcx ? $answer - expr(2) : return $answer while 1; } for ( @ARGV ? @ARGV : scalar <> ) # source as commandline args or stdi +n { my $answer = expr(0); /\G\s*\z/gc ? print s/\s*\z/ = $answer\n/r : error 'incomplete parse +'; }

        which was also the basis for this Re: Parsing Boolean expressions

        but it doesn't need the precedence stuff.

        Here's the same code with one debug print added to show what's going on at the beginning of each time around the "while 1" loop.

        #!/usr/bin/perl # http://perlmonks.org/?node_id=1189605 use strict; use warnings; $| = 1; sub err { die "ERROR: ", s/\G/<@_>/r, "\n" } sub crossmap { my ($left, $right) = @_; [ map { my $prefix = $_ ; map $prefix.$_, @$right } @$left ]; } sub expr { my $answer = [ '' ]; print("\ndebug ", s/\G/<<you are here with [@$answer]>>/r, " end " +), $answer = /\G\|/gc ? [ @$answer, @{ expr() } ] : /\G\d+/gc ? crossmap $answer, [ $& ] : /\G\[\d+\]/gc ? crossmap $answer, [ $& =~ /\d/g ] : /\G\(/gc ? ( crossmap($answer, expr()), /\G\)/gc || err "missing ')'")[0] : return $answer while 1; } while(<DATA>) { chomp; print "$_ => "; my $answer = expr(); /\G\z/gc or err "incomplete parse"; local $" = ','; print "@$answer\n"; } __DATA__ (65|70) (3[678]|4[1678]) 5[45] (6[4569]|7[01])

        It prints the following ( showing both where the current position in the input is (the \G) and the contents of the variable $answer at the time).:

        (65|70) => debug <<you are here with []>>(65|70) end debug (<<you are here with []>>65|70) end debug (65<<you are here with [65]>>|70) end debug (65|<<you are here with []>>70) end debug (65|70<<you are here with [70]>>) end debug (65|70<<you are here with [65 70]>>) end debug (65|70)<<you are here with [65 70]>> end 65,70 (3[678]|4[1678]) => debug <<you are here with []>>(3[678]|4[1678]) end debug (<<you are here with []>>3[678]|4[1678]) end debug (3<<you are here with [3]>>[678]|4[1678]) end debug (3[678]<<you are here with [36 37 38]>>|4[1678]) end debug (3[678]|<<you are here with []>>4[1678]) end debug (3[678]|4<<you are here with [4]>>[1678]) end debug (3[678]|4[1678]<<you are here with [41 46 47 48]>>) end debug (3[678]|4[1678]<<you are here with [36 37 38 41 46 47 48]>>) end + debug (3[678]|4[1678])<<you are here with [36 37 38 41 46 47 48]>> end + 36,37,38,41,46,47,48 5[45] => debug <<you are here with []>>5[45] end debug 5<<you are here with [5]>>[45] end debug 5[45]<<you are here with [54 55]>> end 54,55 (6[4569]|7[01]) => debug <<you are here with []>>(6[4569]|7[01]) end debug (<<you are here with []>>6[4569]|7[01]) end debug (6<<you are here with [6]>>[4569]|7[01]) end debug (6[4569]<<you are here with [64 65 66 69]>>|7[01]) end debug (6[4569]|<<you are here with []>>7[01]) end debug (6[4569]|7<<you are here with [7]>>[01]) end debug (6[4569]|7[01]<<you are here with [70 71]>>) end debug (6[4569]|7[01]<<you are here with [64 65 66 69 70 71]>>) end debug (6[4569]|7[01])<<you are here with [64 65 66 69 70 71]>> end 64 +,65,66,69,70,71

        I hope this is helpful. At each debug output you can see what the /\G.../gc has stepped over and what the new value of $answer is.

        My problem with explaining this type of parser is that I have been working with parsers like this for well over a year, and it all comes as second nature to me. If you have more specific questions I'll be willing to take a shot at answering them.

        Hi Discipulus

        tybalt89 is combining the \G meta with //gc modifiers

        • \G achors the search at the pos where the last regex left off
        • unless you used the //c modifier, the pos of a previous regex was erased when it didn't match
        This allows to split a big complicated regex into multiple perl statements.

        And after combining with recursive calls this permits to parse nested structures.

        See perldocs for examples and details of this technique (e.g. perlretut ) or Friedl's Regex Book.

        Or even better older threads in the monastery :)

        You shouldn't be despaired, the demonstrated code is neither easy to read nor maintain.

        It should better use //x modifiers and more comments, but obfuscated wizardry is of course cooler! ;-))

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: Regular Expression Test
by Anonymous Monk on May 05, 2017 at 20:19 UTC
      No, they are all possibilities for regular expression.
        Then I am afraid it probably can't be done.

        All the regex examples you've shown are deterministic, so to speak, but how would you expand, for example, a quantifier such as * or +? a+ could be anything like "a", "aa", "aaa", and so on. How would you express anchors? Zero-width assertions? Greedy and non-greedy quantifiers? And so on.

        Or you would have to explain how you want them to be translated. Or perhaps I missed what you're looking for?

Re: Regular Expression Test (XY Problem)
by LanX (Sage) on May 06, 2017 at 13:53 UTC
    Why?

    This looks a lot like a XY Problem , please specify your intention to get the best answer.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1189605]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2022-05-16 09:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (62 votes). Check out past polls.

    Notices?