Regular Expression Test

leoberbert has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expression Test by AnomalousMonk (Archbishop) on May 05, 2017 at 22:06 UTC
As Laurent_R noted, the general form of this problem is potentially very difficult. The usual approach is to restrict as much as possible the regex operators to be supported. For an interesting (and possibly even helpful!) discussion of the problem, see Higher Order Perl, (free download here), section 6.5 "Regex String Generation". Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re: Regular Expression Test by tybalt89 (Monsignor) on May 06, 2017 at 00:31 UTC
#!/usr/bin/perl # http://perlmonks.org/?node_id=1189605 use strict; use warnings; $\| = 1; sub err { die "ERROR: ", s/\G/<@_>/r, "\n" } sub crossmap { my ($left, $right) = @_; [ map { my $prefix = $_ ; map $prefix.$_, @$right } @$left ]; } sub expr { my $answer = [ '' ]; $answer = /\G(\d+)/gc ? crossmap $answer, [ $1 ] : /\G\\|/gc ? [ @$answer, @{ expr() } ] : /\G\[(\d+)\]/gc ? crossmap $answer, [ split //, $1 ] : /\G$/gc ? ( crossmap($answer, expr()), /\G$/gc \|\| err "missing ')'")[0] : return $answer while 1; } while(<DATA>) { chomp; print "$_ => "; my $answer = expr(); /\G\z/gc or err "incomplete parse"; local $" = ','; print "@$answer\n"; } __DATA__ (65\|70) (3[678]\|4[1678]) 5[45] (6[4569]\|7[01]) [download] Hey, it works for all your test cases :) UPDATE: typo fixed	[reply] [d/l]
Re^2: Regular Expression Test by Discipulus (Canon) on May 06, 2017 at 12:44 UTC
tybalt89++ I'd like to understand it.. ;=) Is this a lex parser as noted in `perlre` ? It just support the test cases? or can be extended to support more regex operators? I was trying a raw approach but when i saw your i gave up. Brilliant regex hacker! UPDATE: my attempt with the simplicistic data sample proposed in OP (show representative data as corollary to Know your data principle), was something like: use strict; use warnings; while(<DATA>){ chomp; my $rx; eval{ $rx = qr($_) }; if($@){print "-->$_<-- INVALID: ",lc $@=~s/$0 line \d+,//r; next +} print "$rx\t=>\t"; s/[]//g; foreach my $part (split /\\|/){ if ($part=~/(\d+)?\[\d+\]/){ print join ' ', grep{/$rx/}$1.0 .. $1.9; print ' '; } else{print "$part "} } print "\n"; } __DATA__ (65\|70) (3999[678]\|4[1678]) 5[45] (6[4569]\|7[01]) (((broken (65\|70) # out (?^:(65\|70)) => 65 70 (?^:(3999[678]\|4[1678])) => 39996 39997 39998 41 46 47 48 (?^:5[45]) => 54 55 (?^:(6[4569]\|7[01])) => 64 65 66 69 70 71 -->(((broken<-- INVALID: unmatched ( in regex; marked by <-- here in m +/((( <-- here broken/ at <data> line 5. (?^:(65\|70)) => 65 70 [download] L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^3: Regular Expression Test by tybalt89 (Monsignor) on May 06, 2017 at 18:44 UTC
This is not just a "lex"-like scanner as noted in perlre, but a fully recursive parser (note the calls to expr() inside some of the actions). It will properly handle multiple levels of parentheses. The advice I would give is to instrument it up with debug prints to see what's happening, also adding a recursion level counter to track recursive calls. It is loosely based on this expression parser I wrote a while ago: #!/usr/bin/perl use strict; # mini.pl - modified Pratt parser by tybalt89 use warnings; # https://en.wikipedia.org/wiki/Pratt_parser sub error { die "ERROR ", s/\G/ <@_> /r, "\n" } sub expr # two statement parser - precedences: (3 *) (2 /) (1 + -) { my $answer = /\G\s* ((?:\d+(?:\.\d)?\|\.\d+)(e[+-]?\d+)?) /gcxi ? $1 : /\G\s$/gc ? (expr(0), /\G\s$/gc \|\| error 'missing )')[0] : /\G\s - /gcx ? -expr(3) : # unary minus /\G\s* \+ /gcx ? +expr(3) : # unary plus error 'bad operand'; $answer = $_[0] <= 3 && /\G\s* \\ /gcx ? $answer ** expr(3) : $_[0] <= 2 && /\G\s* \* /gcx ? $answer * expr(3) : $_[0] <= 2 && /\G\s* \/ /gcx ? $answer / expr(3) : $_[0] <= 1 && /\G\s* \+ /gcx ? $answer + expr(2) : $_[0] <= 1 && /\G\s* \- /gcx ? $answer - expr(2) : return $answer while 1; } for ( @ARGV ? @ARGV : scalar <> ) # source as commandline args or stdi +n { my $answer = expr(0); /\G\s\z/gc ? print s/\s\z/ = $answer\n/r : error 'incomplete parse +'; } [download] which was also the basis for this Re: Parsing Boolean expressions but it doesn't need the precedence stuff.	[reply] [d/l]
Re^4: Regular Expression Test by RonW (Parson) on May 10, 2017 at 22:01 UTC
Re^3: Regular Expression Test by tybalt89 (Monsignor) on May 06, 2017 at 22:31 UTC
Here's the same code with one debug print added to show what's going on at the beginning of each time around the "while 1" loop. #!/usr/bin/perl # http://perlmonks.org/?node_id=1189605 use strict; use warnings; $\| = 1; sub err { die "ERROR: ", s/\G/<@_>/r, "\n" } sub crossmap { my ($left, $right) = @_; [ map { my $prefix = $_ ; map $prefix.$_, @$right } @$left ]; } sub expr { my $answer = [ '' ]; print("\ndebug ", s/\G/<<you are here with [@$answer]>>/r, " end " +), $answer = /\G\\|/gc ? [ @$answer, @{ expr() } ] : /\G\d+/gc ? crossmap $answer, [ $& ] : /\G\[\d+\]/gc ? crossmap $answer, [ $& =~ /\d/g ] : /\G$/gc ? ( crossmap($answer, expr()), /\G$/gc \|\| err "missing ')'")[0] : return $answer while 1; } while(<DATA>) { chomp; print "$_ => "; my $answer = expr(); /\G\z/gc or err "incomplete parse"; local $" = ','; print "@$answer\n"; } __DATA__ (65\|70) (3[678]\|4[1678]) 5[45] (6[4569]\|7[01]) [download] It prints the following ( showing both where the current position in the input is (the \G) and the contents of the variable $answer at the time).: (65\|70) => debug <<you are here with []>>(65\|70) end debug (<<you are here with []>>65\|70) end debug (65<<you are here with [65]>>\|70) end debug (65\|<<you are here with []>>70) end debug (65\|70<<you are here with [70]>>) end debug (65\|70<<you are here with [65 70]>>) end debug (65\|70)<<you are here with [65 70]>> end 65,70 (3[678]\|4[1678]) => debug <<you are here with []>>(3[678]\|4[1678]) end debug (<<you are here with []>>3[678]\|4[1678]) end debug (3<<you are here with [3]>>[678]\|4[1678]) end debug (3[678]<<you are here with [36 37 38]>>\|4[1678]) end debug (3[678]\|<<you are here with []>>4[1678]) end debug (3[678]\|4<<you are here with [4]>>[1678]) end debug (3[678]\|4[1678]<<you are here with [41 46 47 48]>>) end debug (3[678]\|4[1678]<<you are here with [36 37 38 41 46 47 48]>>) end + debug (3[678]\|4[1678])<<you are here with [36 37 38 41 46 47 48]>> end + 36,37,38,41,46,47,48 5[45] => debug <<you are here with []>>5[45] end debug 5<<you are here with [5]>>[45] end debug 5[45]<<you are here with [54 55]>> end 54,55 (6[4569]\|7[01]) => debug <<you are here with []>>(6[4569]\|7[01]) end debug (<<you are here with []>>6[4569]\|7[01]) end debug (6<<you are here with [6]>>[4569]\|7[01]) end debug (6[4569]<<you are here with [64 65 66 69]>>\|7[01]) end debug (6[4569]\|<<you are here with []>>7[01]) end debug (6[4569]\|7<<you are here with [7]>>[01]) end debug (6[4569]\|7[01]<<you are here with [70 71]>>) end debug (6[4569]\|7[01]<<you are here with [64 65 66 69 70 71]>>) end debug (6[4569]\|7[01])<<you are here with [64 65 66 69 70 71]>> end 64 +,65,66,69,70,71 [download] I hope this is helpful. At each debug output you can see what the /\G.../gc has stepped over and what the new value of $answer is. My problem with explaining this type of parser is that I have been working with parsers like this for well over a year, and it all comes as second nature to me. If you have more specific questions I'll be willing to take a shot at answering them.	[reply] [d/l] [select]
Re^3: Regular Expression Test by LanX (Saint) on May 06, 2017 at 19:46 UTC
Hi Discipulus tybalt89 is combining the \G meta with //gc modifiers \G achors the search at the `pos` where the last regex left off unless you used the //c modifier, the pos of a previous regex was erased when it didn't match This allows to split a big complicated regex into multiple perl statements. And after combining with recursive calls this permits to parse nested structures. See perldocs for examples and details of this technique (e.g. `perlretut` ) or Friedl's Regex Book. Or even better older threads in the monastery :) You shouldn't be despaired, the demonstrated code is neither easy to read nor maintain. It should better use //x modifiers and more comments, but obfuscated wizardry is of course cooler! ;-)) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^4: Regular Expression Test by tybalt89 (Monsignor) on May 07, 2017 at 19:04 UTC
Re^5: Regular Expression Test by LanX (Saint) on May 07, 2017 at 19:26 UTC
Re^4: Regular Expression Test by shmem (Chancellor) on May 06, 2017 at 20:30 UTC
Re^5: Regular Expression Test by LanX (Saint) on May 06, 2017 at 21:10 UTC
Re: Regular Expression Test by Anonymous Monk on May 05, 2017 at 20:19 UTC
Are `()`, `\|` and `[]` the only regexp operators you need to support? See generating strings from a regular expression, Regexp generating strings?, and Regexp::Genex.	[reply] [d/l] [select]
Re^2: Regular Expression Test by leoberbert (Novice) on May 05, 2017 at 20:35 UTC
No, they are all possibilities for regular expression.	[reply]
Re^3: Regular Expression Test by Laurent_R (Canon) on May 05, 2017 at 21:19 UTC
Then I am afraid it probably can't be done. All the regex examples you've shown are deterministic, so to speak, but how would you expand, for example, a quantifier such as * or +? `a+` could be anything like "a", "aa", "aaa", and so on. How would you express anchors? Zero-width assertions? Greedy and non-greedy quantifiers? And so on. Or you would have to explain how you want them to be translated. Or perhaps I missed what you're looking for?	[reply] [d/l]
Re: Regular Expression Test (XY Problem) by LanX (Saint) on May 06, 2017 at 13:53 UTC
Why? This looks a lot like a XY Problem , please specify your intention to get the best answer. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]


Perl: the Markov chain saw
	PerlMonks