in reply to Re: Capturing everything after an optional character in a regex?
in thread Capturing everything after an optional character in a regex?

Can you explain why my original regex doesn't work? Maybe I don't understand the finer points of greediness. I would think that the regex would try to match the X first and once successful, would try to match the \S+ and succeed at that.
  • Comment on Re: Re: Capturing everything after an optional character in a regex?

Replies are listed 'Best First'.
Re: Re: Re: Capturing everything after an optional character in a regex?
by Anonymous Monk on Dec 04, 2003 at 07:06 UTC
    Can you explain why my original regex doesn't work? Maybe I don't understand the finer points of greediness. I would think that the regex would try to match the X first and once successful, would try to match the \S+ and succeed at that.

    The pattern you gave is /X?(\S+)/. That says, match zero-or-one 'X' character followed by (and capture) one-or-more non-space characters. Now with the string "abcX123", the re begins at the beginning of the string and asks itself "can I match zero-or-one 'X' characters here?, and the answer is 'Yes, I can successfully match zero 'X' characters righ here' which it does, and then goes ahead and tries to match one-or-more non-space characters (which it also does). Does that help you get the idea?

Re: Re: Re: Capturing everything after an optional character in a regex?
by davido (Cardinal) on Dec 04, 2003 at 07:02 UTC
    Your original regex was m/X?(\S+)/

    The problem is that the + quantifier is greedier than ?, and will thus, try to match as many characters as possible. Since the X is optional, due to the ? quantifier, X? is yielding to the \S+ portion of your pattern, so that \S+ matches everything even if there is an X that could have matched X?.

    You may be able to get around that problem as simply as by specifying non-greedy matching for the \S+ portion of the regex. In fact, that might be a better solution than the others I've suggested later in this thread. However, I tend to like to spell things out more clearly than simply making something non-greedy and hoping for the best. My later suggestions force \S+ to give up something, whereas specifying non-greediness just weights the tug-of-war.

    Nevertheless, specifying non-greed might just be the simplest approach to your problem, so here it is (untested):

    m/X?(\S+?)$/

    Updated: As another Anonymous Monk pointed out, forcing non-greed in the \S+ portion of the regex doesn't help, and thus, the answers I've posted lower in this thread are preferable over the one I've striked out in this node. Or Roger's answer, which allows either case to be captured by the same set of parens, negating the need to count capturing parens. Anon is right though, X? being optional makes \S+ (and \S+?) rob the X from X?


    Dave

      The greediness of \$+ has nothing to do with the observed behavior, and making it non-greedy doesn't help the situation. The "problem" is strictly the optional nature of the X?.

      This is slightly OT, but, I have to ask... why does greediness get the blame for so much? I am not an expert in RE engines, but I am pretty sure that "leftmostness" trumps greediness nearly every time. Correct? ie: "leftmost" match always succeeds before the "best" match, or "biggest" match.

      \S+'s greediness doesn't really figure into this problem in the very least, as far as I can tell. Greediness is right-acting, not omni-directional.