bbfu has asked for the wisdom of the Perl Monks concerning the following question:

Okay, this one has got me totally baffled. Basically, I have a regular expression that is slurping up more data than it rightly should. I really can't see anything I'm doing wrong and the results are totally unexpected, though the problem seems to be related to having \\? in the RE. It's hard to describe so I'll just go straight into the code...

Using this data:

$_ = 'foo\$bar';

This tokenizing code snippet:

{ if(m/\G(\\?\$\w+)/gc or m/\G(\w+)/gc) { print "token: '$1'\n"; redo; } }

Produces the following output:

token: '\$bar'

That is, it immediately matches '\$bar' and 'foo' gets skipped over (though it should be matched first by the second RE). If the \\? is changed to \\ in the first RE, or if the input is changed to 'foo$bar', then the 'foo' is matched correctly by the second RE. Unfortunately, neither of those is the logic I wanted.

Also, it doesn't seem to be the exact \\?, since it still matches incorrectly if I change it to (?:\\)?.

I really hope someone can give me a hint as to what is causing this strange behavior. TIA

(This is perl, version 5.005_03 built for i386-linux)

bbfu
Seasons don't fear The Reaper.
Nor do the wind, the sun, and the rain.
We can be like they are.

Replies are listed 'Best First'.
Re: RE prollem: \G, \\? and disappearing data
by MeowChow (Vicar) on Feb 12, 2001 at 14:28 UTC
    This code actually does work as you would like under 5.6.0 (ActivePerl, Linux, and Cygwin) and 5.7 (linux), but not under 5.005_03. Your response to daveorg was correct. I believe you are hitting upon a bug in the older Perl's regex engine.
       MeowChow                                   
                   s aamecha.s a..a\u$&owag.print

      Blah. That's what I was afraid of. :-(

      I don't have access to a different version of perl and this is something I rather need to match. I don't suppose you know of a way around this bug besides switching perl versions?

      Thanks for confirming my suspicions, though it's not really what I wanted to hear. :-) I was hoping I was just doing something dumb that could be fixed.

      bbfu
      Seasons don't fear The Reaper.
      Nor do the wind, the sun, and the rain.
      We can be like they are.

        I tried for a while, but I couldn't coax Perl into playing nice with this one :)

        It looks like Perl 5.005_03 simply ignores the \G anchor when a '?' modifier is present. It also ignores \G if there's a {0,1} modifier, but oddly enough, not in the cases of '*' or '{0,2}'. Though not quite accurate, perhaps one of these is a suitable substitute.

        Another option is to expand the '?' modifier into its two possibilities like so:

        m/\G(\\\$\w+|\$\w+)/gc
        You can also dispense with the \G modifier altogether, and just slurp away the lexed tokens, using s/TOKEN// for matching, though this method is somewhat less efficient.
           MeowChow                                   
                       s aamecha.s a..a\u$&owag.print
(bbfu)(solution...sorta)Re: RE prollem: \G, \\? and disappearing data
by bbfu (Curate) on Feb 12, 2001 at 15:05 UTC

    Okay, well thanks to MeowChow's comment that confirmed for me that this is a bug in perl (blech), I went looking for, and found, a work-around.

    The following code works the way it should:

    $_ = 'foo\$bar'; { if(m/\G((?:\\\$|\$)\w+)/gc or m/\G(\w+)/gc) { print "token: '$1'\n"; redo; } }

    And produces the following output:

    token: 'foo' token: '\$bar'

    And it also works correctly on $_ = 'foo$bar'; giving:

    token: 'foo' token: '$bar'

    Thanks, MeowChow, for your help!

    I'm still interested if someone can tell me exactly how this bug is working (or not working, depending on your POV) and why it is. Thanks, everyone!

    bbfu
    Seasons don't fear The Reaper.
    Nor do the wind, the sun, and the rain.
    We can be like they are.

Re: RE prollem: \G, \\? and disappearing data
by davorg (Chancellor) on Feb 12, 2001 at 14:01 UTC

    You're getting burnt by or short-circuiting. Your code works like this.

    1. The first regex is matched against the string. At this point it matches against \$bar
    2. Because the first expression matches (i.e. returns true, the entire or is evaluated as true and there is no need to evaulate the second expression - so Perl doesn't

    I think you'd be better off using alternation in your regular expressions rather than using two regular expressions.

    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

      Hrm. Actually, that's what the \G is for. It anchors to the place that the last /g match matched (sorta like the ^ anchor) and defaults to the start of the string (if no previous matches were made). So, theoretically, the first RE shouldn't match at all until after the second RE has matched at least once. And that's my problem: it isn't working that way.

      Also, there are several other RE's in the actual tokenizer so alternation isn't really an option.

      I built my example more or less out of the snippet provided here under the section about the /g modifier, and the other parts of the "lex-like scanner" (as it's called in the perlop page) work just fine. The problem seems to be totally with the \\? (specifically the ? part).

      I do thank you for the suggestion though. :-)

      (PS: Just so it's clear: the idea is to match an identifier preceded by an ampersand and, optionally, a single backslash. The characters infront of the ampersanded identifier (except for the optional backslash) should be matched by the second RE, effectively breaking the string into "tokens". The actual routine is a bit more complicated and has a lot of other, unrelated code that I took out for (sanity|readibility)'s sake.)

      (PPS: Okay, so I used a dollar-sign in the code snippet. It's an ampersand in the actual routine, I promise. :-) And it doesn't matter anyway.)

      bbfu
      Seasons don't fear The Reaper.
      Nor do the wind, the sun, and the rain.
      We can be like they are.