daemon23 has asked for the wisdom of the Perl Monks concerning the following question:

Alright, I've got the regexp /^(.*)\s*?($OPnot|$OPor|$OPand)\s*?(.*?)$/
When I attempt to evaluate the string "blah, blah2" with it, and have $OPor = ',', $1 contains "blah", and $2 contains ",", but $3 is empty, and I would expect it to contain "blah2". Any ideas on what I'm missing here?

Replies are listed 'Best First'.
Re: Regexp oddity
by chromatic (Archbishop) on Jun 21, 2000 at 07:24 UTC
    If you don't have the end-of-line anchor ($) in your regex in the program, $3 will contain nothing. Otherwise, it will contain " blah2".

    If you make the last parenthesized match non-greedy by removing the trailing question mark, you can leave out the final $.

    I recommend using a line like the following to show what you've captured, just in case you have whitespace: print "1: ->$1<-\n2: ->$2<-\n3: ->$3<-\n";

Re: Regexp oddity
by Adam (Vicar) on Jun 21, 2000 at 03:37 UTC
    It worked for me:
    C:\>perl -We "$_='blah, blah2';$OPnot='-';$OPor=',';$OPand='\+'; /^(.*)\s*?($OPnot|$OPor|$OPand)\s*?(.*?)$/; print qq[1='$1', 2='$2', 3 +='$3']" 1='blah', 2=',', 3=' blah2'
    (WinNT, ActiveState 5.6)

    BTW: If you want \s*? to match anything, you should remove the question mark. *? will happily match the gap between chars.
    (* matches zero or more, but ? tells it to match as little as possible, aka zero.)

      Adam, that's not quite accurate about the '?'. If a question mark follows a quantifier (*?, +?, {min, max}? or ??) in a regex, it makes it "non-greedy". Consider the following code.
      # 3 spaces, a tab, 3 more spaces, another tab and 3 more spaces (repre +sent by chr() for clarity) $test = chr(32)x3 . chr(9) . chr(32)x3 . chr(9) . chr(32)x3; ($first = $1, $second = $2) if $test =~ /(\s*)\t(\s*)/;
      In this case, the first (\s*) will be greedy and attempt to match as many characters as possible. $first will contain 3 spaces, a tab, and 3 more spaces. $second will contain 3 spaces. However, by adding the question mark, we make it non-greedy.
      ($first = $1, $second = $2) if $test =~ /(\s*?)\t(\s*)/;
      This means that (\s*?) attempt the smallest match possible that satisfies that above regex. In this case, $first contains 3 spaces and $second contains 3 spaces, a tab, and 3 more spaces. The '?' does not mean "aka zero".

      Incidentally, most regexes ending in (.*?)$/ (like the one in the original post) have a superfluous ? because there is no way to make that statement non-greedy, since it's forced to match to the end.

        You are correct, perhaps I should have been more clear. The regex that we were discussing ends with \s*?(.*?)$/; which is somewhat different from your example. Here it is matching the fewest spaces followed by the fewest 'anything but newlines' to the end of the string. Since the . will match white space, the \s*? will match nothing. Always. But thank you for your clarification of the more generic case.
      \s*? is set that way on purpose in case the words have no whitespace between them.
        The question mark in \s*? is not necessary if you are doing that "in case the words have no whitespace between them." The * quantifier matches zero or more of whatever it is quantifying.
        $test = "az"; print "Good\n" if $test =~ /a\s*z/;
        The above regex sees an 'a', followed by zero spaces, followed by a 'z'. Since this matches the value of $test, it prints "Good\n".

        Cheers!

Re: Regexp oddity
by daemon23 (Initiate) on Jun 21, 2000 at 20:49 UTC
    My thanks to everyone who wrote back on this--I finally figured it out.

    $OPand was set to '\s+', as this was how the input string is set. This is also why I was using \s*?, believing it would capture any whitespace in the case of the $OPor or $OPnot separators. However, perl was evaluating "blah, blah2" and returning $1 = 'blah,', $2 = ' ', and $3 = 'blah2'. The reason I erroneously assumed the regexp was destroying 'blah2' is the script loops, evaluating $1 as the test string. The handler for ($2 =~ /^\s+$/) was written incorrectly, so the script just skipped over the first iteration. It worked on the second iteration, however, evaluating "blah,", and thus finding $1 = "blah", $2 = ",", and $3 = "".

    Again, thanks for the pointers--they definitely helped me figure out what I'd done incorrectly.

Re: Regexp oddity
by btrott (Parson) on Jun 21, 2000 at 03:41 UTC
    What are $OPnot and $OPand equal to when you use the regexp? That will make a difference.