Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Dynamic regex assertions, capturing groups, and parsers: joy and terror

by japhy (Canon)
on Oct 03, 2005 at 14:59 UTC ( #496948=perlmeditation: print w/replies, xml ) Need Help??

I'm pushing the limits of Perl's regexes, and I've come across an ugliness. I'm trying to write a simple parser that produces a tree structure that represents the data being parsed. (Specifically, parsing eBay search strings into a logic tree.) It appears that the "postponed regular expression" assertion, (??{ CODE }), does not play well with capturing groups. Observe:
# prints 'j' "japhy" =~ m{ (.) (?{ print $1 }) }x; # prints nothing (undef, specifically) $rx = qr{ (.) (??{ print $1 }) }x; "japhy" =~ m{ (??{ $rx }) }x;
I know it's "experimental", but if this doesn't work now, it probably hasn't worked ever, which means nothing's been done about it, and I'm sure it's been reported as a bug before. The work-around I'm employing is shown in my code below. The code I'm showing is a proof-of-concept that $^R can be used in conjunction with (??{ ... }), although I'm sure I'm not the first person to attempt this.
use Data::Dumper; $Data::Dumper::Indent = 1; use strict; sub ebay_search_logic { my $str = shift; my ($word, $neg, $alt); $word = qr{ (?{ save_pos() }) (\w+) (?{ push_word() }) }x; $neg = qr{ - (??{ $word }) (?{ mod_neg() }) }x; $alt = qr{ \( (??{ $word }) (?{ alt1(); }) (?: , (??{ $word }) (?{ a +lt2() }) )+ \) }x; return $str =~ m{ (?{ [] }) ^ \s* (?: (??{ $word }) | (??{ $neg }) | (??{ $alt }) ) (?: \s+ (?: (??{ $word }) | (??{ $neg }) | (??{ $alt }) ) )* \s* $ (?{ print Dumper($^R); $^R; }) }x; return $str; } print ebay_search_logic("this that those"), "\n"; # LIKE 'this' AND + LIKE 'that' AND LIKE 'those' print ebay_search_logic("this -that those"), "\n"; # LIKE 'this' AND + (NOT LIKE 'that') AND LIKE 'those' print ebay_search_logic("this (that,those)"), "\n"; # LIKE 'this' AND + (LIKE 'that' OR LIKE 'those') sub save_pos { my @r = @{ $^R }; [ @r, $+[0] ]; } sub push_word { my @r = @{ $^R }; my $p = pop @r; my $w = substr($_, $p, $+[0] - $p); [ @r, { WORD => $w } ]; } sub mod_neg { my @r = @{ $^R }; my $w = pop @r; [ @r, { NOT => $w->{WORD} } ]; } sub alt1 { my @r = @{ $^R }; my $w = pop @r; [ @r, { ALT => [ $w->{WORD} ] } ]; } sub alt2 { my @r = @{ $^R }; my $w = pop @r; my $alt = pop @r; [ @r, { ALT => [ @{ $alt->{ALT} }, $w->{WORD} ] } ]; }

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Replies are listed 'Best First'.
Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror
by blokhead (Monsignor) on Oct 03, 2005 at 17:07 UTC
    At first I thought it might have to do with the bizarre static/dynamic scoping duality of the $1, $2, ... variables, since the postponed (??{CODE}) blocks may be compiled somewhere where they can't reach the correct $1, $2, .... But they also can't seem to access @- and @+, which I don't think have the same scoping properties.

    And now I don't know what to think, because of the following example: I add an empty capturing group before the (??{CODE}) block in the outermost match, and it works (or at least seems to):

    # This is perl, v5.8.6 built for i386-linux-thread-multi use Data::Dumper; $rx = qr{ (.) (?{ print Dumper $1 }) }x; "japhy" =~ m{ (??{ $rx }) }x; ## $VAR1 = undef; $rx = qr{ (.) (?{ print Dumper $1 }) }x; "japhy" =~ m{ () (??{ $rx }) }x; ## $VAR1 = 'j';
    If you were to dump @- and @+ from inside the first example, you'd see that @- has two entries, but @+ has one. It's as if $1 was only partially "set up"..

    Now I don't know if this "workaround" helps you or not. It could probably allow you to write the parser closer to how you originally envisioned. But it seems like a more fragile workaround than the one you have, and I'm not sure how much I trust it. I think I remember seeing similar weirdness with capturing parens somewhere else, but I can't find the reference at the moment.

    Then there's the fact that doing this with a regex is silly, when you could write a RecDescent grammar in about 5 seconds.. but I know you know that ;)


Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror
by itub (Priest) on Oct 03, 2005 at 19:03 UTC
    I once wrote a parser as a huge regex, full of "experimental" features.

    I'll never do it again.

    I started getting all sorts of weird errors such as segfaults, and they varied wildly between perl versions. I could never find a simple example to reproduce the crash, so I couldn't even file a proper bug report. Also, the huge regex was unmaintainable. Now I'd rather use a parser generator. I'm happy with Parse::YAPP. It's not as trendy as Parse::RecDescent, but it's way faster in my experience.

      It might help to remind that you can't run anything using the regex engine while you're inside a (?{...}) or (??{...}) block. You'll usually get segfaults and such if you do that. The engine isn't re-entrant and if you invoke a regex during a regex, you scribble on memory. Its supposed to have gotten better during 5.8 but I haven't tried it again.

        Thanks, that might be what was happening! I was calling subroutines from the ?{} blocks and it is very likely that some of them used regexes internally.

      A much better alternative that using lots of experimental features to write a single-regex parser is to split the matching across lots of /gc regexes. The resulting code is much easier to follow too, and you don’t need contortions to keep a grip on backtracking (my kingdom for Perl6’s commit!).

      Makeshifts last the longest.

Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror
by demerphq (Chancellor) on Oct 03, 2005 at 22:32 UTC

    I find this quite confusing as well.

    Update: No this makes perfect sense. Thanks to dio for straightening me out.

    $rx = qr{ (.) (??{ print $1 }) }x; print "!" if "japhy" =~ $rx; __END__ japhy

    How does $1 end up being 'japhy' with this re? Interestingly, changing it to

    $rx = qr{ (.) (??{ print $1; '' }) }x; print "!" if "japhy" =~ $rx; __END__ j!

    makes things work out properly.


      The result of (??{ print $1 }) is 1 because print() succeeded in writing to STDOUT. The regex that was then compiled by (??{ ...}) was "1" which then failed. So the (.) advanced over every character and printed them individually. The proper thing to do here would have been (?{ ... }) which will not affect regex matching.

        Doh. Of course. I knew that the print returning 1 failed the match, but i didn't put two and two together to realize that was why all of the chars were printed. And I've used this technique deliberately before too. /gah.

        Thanks for the clue-by-four. :-)


Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror
by QM (Parson) on Oct 03, 2005 at 21:49 UTC
    I expect you want to gripe about the behavior of (??{ CODE }), so this may be out of hand...

    Would Parse::RecDescent be useful for parsing into a tree structure for you?

    I ask because I saw an interesting talk on this last week at the Toronto Perl Mongers meeting, and thought I should give it a shot for my next parsing project.

    Quantum Mechanics: The dreams stuff is made of

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://496948]
Approved by cristian
Front-paged by gmax
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2022-11-29 02:29 GMT
Find Nodes?
    Voting Booth?