Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I want '123', but I get 'abcX123'. How can I capture everything after an optional character?
perl -e '$string="abcX123"; $string =~ /X?(\S+)/; print $1;'
Note: This is a smaller part of a larger regex and I'm looking for a regex solution. Thanks!

Replies are listed 'Best First'.
Re: Capturing everything after an optional character in a regex?
by sauoq (Abbot) on Dec 04, 2003 at 09:04 UTC

    It usually helps to rephrase what you are looking for. In your case, if I've understood your clarification elsewhere in this thread, you want to capture "all of the non-X characters at the end of the string." In code, that's /([^X]*)$/ and here it is in action:

    $ perl -le '$_ = "abcX123"; print $1 if /([^X]*)$/;' 123 $ perl -le '$_ = "abc123"; print $1 if /([^X]*)$/;' abc123

    -sauoq
    "My two cents aren't worth a dime.";
    

      Simple, elegant.

      I just realized I'd been sitting on a reply for about an hour. :/

      --
      Allolex

Re: Capturing everything after an optional character in a regex?
by davido (Cardinal) on Dec 04, 2003 at 06:11 UTC
    You can't really do what you're asking in the way you're trying to do it.

    Is it the digits you're trying to capture? Or do the digits at least guarantee the start of what you want to capture? It's impossible to expect to start immediately following a character that may not be there. If it's not there, you'll get just everything instead. If 'X' is optional, it's an unreliable anchor.

    Try something like this, if digits mark the start:

    m/(\d\S*)$/

    I used \S* instead of \S+, so that if the string contains 'abc1' the RE will still capture the '1'. Also, you said "everything after...", so I anchored the match all the way to the end of the string with the $ metachar.


    Dave

      Can you explain why my original regex doesn't work? Maybe I don't understand the finer points of greediness. I would think that the regex would try to match the X first and once successful, would try to match the \S+ and succeed at that.
        Can you explain why my original regex doesn't work? Maybe I don't understand the finer points of greediness. I would think that the regex would try to match the X first and once successful, would try to match the \S+ and succeed at that.

        The pattern you gave is /X?(\S+)/. That says, match zero-or-one 'X' character followed by (and capture) one-or-more non-space characters. Now with the string "abcX123", the re begins at the beginning of the string and asks itself "can I match zero-or-one 'X' characters here?, and the answer is 'Yes, I can successfully match zero 'X' characters righ here' which it does, and then goes ahead and tries to match one-or-more non-space characters (which it also does). Does that help you get the idea?

        Your original regex was m/X?(\S+)/

        The problem is that the + quantifier is greedier than ?, and will thus, try to match as many characters as possible. Since the X is optional, due to the ? quantifier, X? is yielding to the \S+ portion of your pattern, so that \S+ matches everything even if there is an X that could have matched X?.

        You may be able to get around that problem as simply as by specifying non-greedy matching for the \S+ portion of the regex. In fact, that might be a better solution than the others I've suggested later in this thread. However, I tend to like to spell things out more clearly than simply making something non-greedy and hoping for the best. My later suggestions force \S+ to give up something, whereas specifying non-greediness just weights the tug-of-war.

        Nevertheless, specifying non-greed might just be the simplest approach to your problem, so here it is (untested):

        m/X?(\S+?)$/

        Updated: As another Anonymous Monk pointed out, forcing non-greed in the \S+ portion of the regex doesn't help, and thus, the answers I've posted lower in this thread are preferable over the one I've striked out in this node. Or Roger's answer, which allows either case to be captured by the same set of parens, negating the need to count capturing parens. Anon is right though, X? being optional makes \S+ (and \S+?) rob the X from X?


        Dave

Re: Capturing everything after an optional character in a regex?
by Roger (Parson) on Dec 04, 2003 at 06:12 UTC
    Use looking behind -
    $string =~ /((?<=X)\S+)/;
    Note that this only sets $1 if X is present.

    Can you please phrase your question more clearly, ie. what's the behaviour you are expecting? What do you want to capture when X is not present?

Re: Capturing everything after an optional character in a regex?
by Anonymous Monk on Dec 04, 2003 at 06:29 UTC
    Clarification: If X is there I want everything after the X. If X is not there I want the whole string. Although I used 123 in the example, they could be any whitespace characters. Thanks!
      Ok, here's my attempt after Anonymous monk's intention is clear.

      $string =~ m/(?:(?=.*?X)X|(?!.*?X))(\S+)/;
      And here's a little test -

      $string1="abcX123"; $string2="abc123"; $string1 =~ m/(?:(?=.*?X)X|(?!.*?X))(\S+)/; print "$1\n"; $string2 =~ m/(?:(?=.*?X)X|(?!.*?X))(\S+)/; print "$1\n";
      And the output is as expected, and both in $1 -
      123 abc123
      And the tricky bit in the above regex is the (?:(?=.*?X)X|(?!.*?X)) part, which defines an optional anchor point.


      Update: I hit my head on the wall a couple of times, literally, after I saw sauoq's much clever solution below. I was locked up with the idea of an optional anchor point, that I have failed to notice the vital bit of the clue - capture till the end, that defined a fixed anchor point to look back from, instead of a floating anchor point that looks forward. Although my solution worked, it was way too complicated than is necessary.

      An important lesson I have learnt today: when a problem seems rather complicated, take a step back and look for other clues. The alternative solution is probably staring right in my face!

      This tests ok for my contrived test string:

      m/X(\S+)|((?<!X)\S+)/

      Or here's a way without negative lookbehind:

      m/X(\S+)|([^X]\S*)/

      They're not functionally identical, but should both accomplish what you've described.

      Of course there is the issue now of counting capturing parens.


      Dave

        And because it's part of a larger regex I'll have to put parens around the alternation...

        Of course there is the issue now of counting capturing parens.

        That's what the /x modifier is there to help for.

        ----
        I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
        -- Schemer

        : () { :|:& };:

        Note: All code is untested, unless otherwise stated

      It seems further clarification would be helpful. What do you want to do in the case that there is more than one 'X'? (Or is that case not in your requirements?) If such a case won't exist, or if you want to get everything after the last 'X', I stand by my original suggestion. Use /([^X]*)$/.

      If you can have more than one 'X' and you want everything after the first X then something like /^(?:.*?X)?(.*)$/ should do the trick.

      -sauoq
      "My two cents aren't worth a dime.";
      
        There should be only one X. Also, I tried to modify your suggestion to /([^X\s]*)/ since, as in my original code, I only want to grab non-whitespace. But then if the string is 'abcX12  3' I only match abc, when I want to match the '12'. Also, as I said before this is part of a larger regex so I can't use begin/end of line characters.
      If X is there I want everything after the X. If X is not there I want the whole string.
      Try this on a copy of the string:
      s/^.*?X//;
      This will only substitute everything upto the first X, if one exists. It'll not change the string if it doesn't.

      Drop the question mark if you want to locate the last "X".

      And if the string can contain newlines, add the /s modifier, which changes the matching behaviour of /./ to possibly match a newline as well.

Re: Capturing everything after an optional character in a regex?
by allolex (Curate) on Dec 04, 2003 at 09:45 UTC
    perl -le '$string="abcX123"; $string =~ /X([^\s]+)/; print $1;' 123

    Update 2003-12-04 12:50:48 CET: See saouq's node in this thread for a solution which fills all of Anonymous Monk's requirements, as mentioned in opqdonut's post below.

    --
    Allolex

      Allolex, your piece o' code doesn't really satisfy the needs of Anonymous Monk, he needed to capture the whole line if there is no X.

      saouq's code is simply beautiful, maybe someday my code will look like that :)

      apparently yours,
      J

        Well, it meets the original requirements as I understood them. See my previous reply to saouq's node for part of the reason (besides laziness and stupidity on my part)---It's nice to see that I'm not the only one who doesn't read the whole thread before posting ;) For the record, davido /msg'd me concerning this about three seconds after I hit 'create'... Even at past 2 a.m. his time, he was more alert than me.

        --
        Allolex

Re: Capturing everything after an optional character in a regex?
by podian (Scribe) on Dec 04, 2003 at 18:54 UTC
    The following works (assuming X as the optional character):

    perl -e '$string="abcX123"; $string =~ /.*X(\S+)/; print $1;'

    I do not understand why you are using X?

Re: Capturing everything after an optional character in a regex?
by serf (Chaplain) on Dec 05, 2003 at 14:50 UTC
    I'm not sure if it's what you're looking for (i.e. doesn't break something else somewhere else...) but I find that:
    perl -e '$string="abcX123"; $string =~ /X+?(\S+)/; print $1;'
    works for me... HTH
      $string = "abcX123"; ($afterX) = ($string =~ /X(.*)/); print $afterX;