Dynamic regex assertions, capturing groups, and parsers: joy and terror

I'm pushing the limits of Perl's regexes, and I've come across an ugliness. I'm trying to write a simple parser that produces a tree structure that represents the data being parsed. (Specifically, parsing eBay search strings into a logic tree.) It appears that the "postponed regular expression" assertion, (??{ CODE }), does not play well with capturing groups. Observe:

# prints 'j'
"japhy" =~ m{ (.) (?{ print $1 }) }x;

# prints nothing (undef, specifically)
$rx = qr{ (.) (??{ print $1 }) }x;
"japhy" =~ m{ (??{ $rx }) }x;
[download]

I know it's "experimental", but if this doesn't work now, it probably hasn't worked ever, which means nothing's been done about it, and I'm sure it's been reported as a bug before. The work-around I'm employing is shown in my code below. The code I'm showing is a proof-of-concept that $^R can be used in conjunction with (??{ ... }), although I'm sure I'm not the first person to attempt this.

use Data::Dumper;
$Data::Dumper::Indent = 1;

use strict;

sub ebay_search_logic {
  my $str = shift;

  my ($word, $neg, $alt);
  $word = qr{ (?{ save_pos() }) (\w+) (?{ push_word() }) }x;
  $neg = qr{ - (??{ $word }) (?{ mod_neg() }) }x;
  $alt = qr{ \( (??{ $word }) (?{ alt1(); }) (?: , (??{ $word }) (?{ a
+lt2() }) )+ \) }x;

  return $str =~ m{
    (?{ [] })
    ^ \s*
    (?: (??{ $word }) | (??{ $neg }) | (??{ $alt }) )
    (?: \s+ (?: (??{ $word }) | (??{ $neg }) | (??{ $alt }) ) )*
    \s* $
    (?{ print Dumper($^R); $^R; })
  }x;

  return $str;
}

print ebay_search_logic("this that those"), "\n";    # LIKE 'this' AND
+ LIKE 'that' AND LIKE 'those' 
print ebay_search_logic("this -that those"), "\n";   # LIKE 'this' AND
+ (NOT LIKE 'that') AND LIKE 'those'
print ebay_search_logic("this (that,those)"), "\n";  # LIKE 'this' AND
+ (LIKE 'that' OR LIKE 'those')

sub save_pos {
  my @r = @{ $^R };
  [ @r, $+[0] ];
}


sub push_word {
  my @r = @{ $^R };
  my $p = pop @r;
  my $w = substr($_, $p, $+[0] - $p);
  [ @r, { WORD => $w } ];
}


sub mod_neg {
  my @r = @{ $^R };
  my $w = pop @r;
  [ @r, { NOT => $w->{WORD} } ];
}


sub alt1 {
  my @r = @{ $^R };
  my $w = pop @r;
  [ @r, { ALT => [ $w->{WORD} ] } ];
}


sub alt2 {
  my @r = @{ $^R };
  my $w = pop @r;
  my $alt = pop @r;
  [ @r, { ALT => [ @{ $alt->{ALT} }, $w->{WORD} ] } ];
}
[download]

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Comment on Dynamic regex assertions, capturing groups, and parsers: joy and terror Select or Download Code

Replies are listed 'Best First'.
Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror by blokhead (Monsignor) on Oct 03, 2005 at 17:07 UTC
At first I thought it might have to do with the bizarre static/dynamic scoping duality of the $1, $2, ... variables, since the postponed (??{CODE}) blocks may be compiled somewhere where they can't reach the correct $1, $2, .... But they also can't seem to access @- and @+, which I don't think have the same scoping properties. And now I don't know what to think, because of the following example: I add an empty capturing group before the (??{CODE}) block in the outermost match, and it works (or at least seems to): `# This is perl, v5.8.6 built for i386-linux-thread-multi use Data::Dumper; $rx = qr{ (.) (?{ print Dumper $1 }) }x; "japhy" =~ m{ (??{ $rx }) }x; ## $VAR1 = undef; $rx = qr{ (.) (?{ print Dumper $1 }) }x; "japhy" =~ m{ () (??{ $rx }) }x; ## $VAR1 = 'j';` [download] If you were to dump @- and @+ from inside the first example, you'd see that @- has two entries, but @+ has one. It's as if $1 was only partially "set up".. Now I don't know if this "workaround" helps you or not. It could probably allow you to write the parser closer to how you originally envisioned. But it seems like a more fragile workaround than the one you have, and I'm not sure how much I trust it. I think I remember seeing similar weirdness with capturing parens somewhere else, but I can't find the reference at the moment. Then there's the fact that doing this with a regex is silly, when you could write a RecDescent grammar in about 5 seconds.. but I know you know that ;) blokhead	[reply] [d/l]
Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror by itub (Priest) on Oct 03, 2005 at 19:03 UTC
I once wrote a parser as a huge regex, full of "experimental" features. I'll never do it again. I started getting all sorts of weird errors such as segfaults, and they varied wildly between perl versions. I could never find a simple example to reproduce the crash, so I couldn't even file a proper bug report. Also, the huge regex was unmaintainable. Now I'd rather use a parser generator. I'm happy with Parse::YAPP. It's not as trendy as Parse::RecDescent, but it's way faster in my experience.	[reply]
Re^2: Dynamic regex assertions, capturing groups, and parsers: joy and terror by diotalevi (Canon) on Oct 03, 2005 at 23:20 UTC
It might help to remind that you can't run anything using the regex engine while you're inside a (?{...}) or (??{...}) block. You'll usually get segfaults and such if you do that. The engine isn't re-entrant and if you invoke a regex during a regex, you scribble on memory. Its supposed to have gotten better during 5.8 but I haven't tried it again.	[reply]
Re^3: Dynamic regex assertions, capturing groups, and parsers: joy and terror by itub (Priest) on Oct 05, 2005 at 20:57 UTC
Thanks, that might be what was happening! I was calling subroutines from the ?{} blocks and it is very likely that some of them used regexes internally.	[reply]
Re^2: Dynamic regex assertions, capturing groups, and parsers: joy and terror by Aristotle (Chancellor) on Oct 09, 2005 at 12:55 UTC
A much better alternative that using lots of experimental features to write a single-regex parser is to split the matching across lots of `/gc` regexes. The resulting code is much easier to follow too, and you don’t need contortions to keep a grip on backtracking (my kingdom for Perl6’s `commit`!). Makeshifts last the longest.	[reply]
Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror by demerphq (Chancellor) on Oct 03, 2005 at 22:32 UTC
I find this quite confusing as well. Update: No this makes perfect sense. Thanks to dio for straightening me out. `$rx = qr{ (.) (??{ print $1 }) }x; print "!" if "japhy" =~ $rx; __END__ japhy` [download] How does $1 end up being 'japhy' with this re? Interestingly, changing it to `$rx = qr{ (.) (??{ print $1; '' }) }x; print "!" if "japhy" =~ $rx; __END__ j!` [download] makes things work out properly. --- $world=~s/war/peace/g	[reply] [d/l] [select]
Re^2: Dynamic regex assertions, capturing groups, and parsers: joy and terror by diotalevi (Canon) on Oct 03, 2005 at 23:09 UTC
The result of (??{ print $1 }) is 1 because print() succeeded in writing to STDOUT. The regex that was then compiled by (??{ ...}) was "1" which then failed. So the (.) advanced over every character and printed them individually. The proper thing to do here would have been (?{ ... }) which will not affect regex matching.	[reply]
Re^3: Dynamic regex assertions, capturing groups, and parsers: joy and terror by demerphq (Chancellor) on Oct 04, 2005 at 09:13 UTC
Doh. Of course. I knew that the print returning 1 failed the match, but i didn't put two and two together to realize that was why all of the chars were printed. And I've used this technique deliberately before too. /gah. Thanks for the clue-by-four. :-) --- $world=~s/war/peace/g	[reply]
Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror by QM (Parson) on Oct 03, 2005 at 21:49 UTC
I expect you want to gripe about the behavior of `(??{ CODE })`, so this may be out of hand... Would Parse::RecDescent be useful for parsing into a tree structure for you? I ask because I saw an interesting talk on this last week at the Toronto Perl Mongers meeting, and thought I should give it a shot for my next parsing project. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]


Perl Monk, Perl Meditation
	PerlMonks