msouth has asked for the wisdom of the Perl Monks concerning the following question:

Making a parenthesized pattern optional with ? is causing the match to fail and I don't understand why.

What I thought: adding a ? after the parens would make the match optional, but if the match was there, I would definitely see it

What seems to me to be happening: adding a ? after the parens says "if there's anyway to not see this match, skip it"...or something. I don't understand.

In the debug session below, first I set up the string to contain nothing but the thing I'm looking for, 'a=b'. If that's all that is in the string, it matches with or without the enclosing parens and question mark.

Then I add an arbitrary extra character ("x") before what I'm trying to match, and the ? seems to say "optional? you bet, I'll definitely take the option of not matching that and just not see your optional thing".

DB<19> $foo = 'a=b'; DB<20> x $foo =~/a=([a-z]+)/; 0 'b' DB<21> x $foo =~/(?:a=([a-z]+))?/; 0 'b' DB<22> $foo = 'x' . $foo DB<23> x $foo =~/(?:a=([a-z]+))?/; 0 undef

I've simplified this to smallest form, but let me give context. I have a very normal situation of something that should always match at the beginning of the line and then there will also sometimes be this "a=b" type construct later. So I wanted to optionally also match the a=b construct. So the real pattern I'm matching is more like (untested)

/(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/

This really seemed to me like a straightforward use of optional match of a sub expression, and I'm completely baffled and feel stupid.

What should I have done here to have something always match that date and occasionally also have the variable assignment later in the string?

  • Comment on Making a subpattern optional with ? causes subpattern match to fail. I am confused as to why.
  • Select or Download Code

Replies are listed 'Best First'.
Re: Making a subpattern optional with ? causes subpattern match to fail. I am confused as to why.
by haj (Vicar) on Aug 16, 2021 at 16:38 UTC

    Regarding your example from the debugger: Your question mark closes the whole construct. After prepending an 'x', the non-capturing parens no longer match at position 0, but since they are optional, they match the empty string - and within that empty string, the first capture is undefined. You can check that by adding the /g modifier to your regex. It should read:

    x $foo =~/(?:a=([a-z]+)?)/;

    The same would apply to your real pattern, too.

      > they match the empty string - and within that empty string, the first capture is undefined.

      to elaborate further here the proof that it's matching, just at the wrong position

      DB<1> $foo = 'a=b'; DB<2> $foo = 'x' . $foo DB<3> p scalar $foo =~/(?:a=([a-z]+))?/; 1 DB<4> p $1 DB<5> DB<6> p scalar $foo =~/(?:a=([a-z]+))?/g; 1 DB<7> p pos $foo 0 DB<8>

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re: Making a subpattern optional with ? causes subpattern match to fail. I am confused as to why.
by AnomalousMonk (Archbishop) on Aug 17, 2021 at 02:20 UTC
    ... the real pattern I'm matching is more like (untested)

    /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/
    ...
    What should I have done here to have something always match that date and occasionally also have the variable assignment later in the string?

    I've put together some example strings (which you did not provide) and some regexes to try to answer your question.

    Consider:

    Win8 Strawberry 5.8.9.5 (32) Mon 08/16/2021 20:36:57 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings use Data::Dump qw(dd); for my $s ( '2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd', '2021-08-18 a=bcd', '2021-08-19 a=b', '2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy', ) { my $matched = $s =~ /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails + all # /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails +some # /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works # /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works ; dd $s, $1, $2 if $matched; # dd $s, $1, $2; # ??? } ^Z ("2021-08-16 foo a=bcd bar", "2021-08-16", undef) ("2021-08-17 foo a=bcd", "2021-08-17", undef) ("2021-08-18 a=bcd", "2021-08-18", undef) ("2021-08-19 a=b", "2021-08-19", undef) ("2021-08-20 a=", "2021-08-20", undef) ("2021-08-21 ", "2021-08-21", undef)
    /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/   (Update: This is the regex quoted above.) This fails to extract the optional assignment variable in all cases. Why? (Note that the date substring is always properly extracted, as also in all code below.)

    .* is greedy and will consume everything (except, by default, newlines) to the end of the string. Then (?:a=([a-z]+)) tries to match and cannot because the match point is at the end of the string. That's OK because (?:a=([a-z]+))? is optional (update: and so the RE need not backtrack); the overall match can succeed. However, the assignment variable is never captured because .* has already run past it in the string: it's not there to capture.

    Next:

    >perl -Mstrict -Mwarnings use Data::Dump qw(dd); for my $s ( '2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd', '2021-08-18 a=bcd', '2021-08-19 a=b', '2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy', ) { my $matched = $s =~ # /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails + all /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails +some # /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works # /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works ; dd $s, $1, $2 if $matched; # dd $s, $1, $2; # ??? } ^Z ("2021-08-16 foo a=bcd bar", "2021-08-16", undef) ("2021-08-17 foo a=bcd", "2021-08-17", undef) ("2021-08-18 a=bcd", "2021-08-18", "bcd") ("2021-08-19 a=b", "2021-08-19", "b") ("2021-08-20 a=", "2021-08-20", undef) ("2021-08-21 ", "2021-08-21", undef)
    /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/   (Update: This is the regex quoted above.) Making .*? lazy helps a bit, but some of the variables that are present are still not captured. (Again, the date substrings are always captured.)

    Failure to capture happens when something like 'foo' is present before the assignment substring. (I assume junk like 'foo' may be present because what's the point of the .* otherwise?) If .*? matches and is immediately followed by (?:a=([a-z]+))?, the assignment will be matched and the variable captured. If there is anything (e.g., 'foo') following the .*? that is not an assignment substring, the .*? will match and there will be an overall match because (?:a=([a-z]+))? is still completely optional; the assignment variable will not be captured.

    What about:

    >perl -Mstrict -Mwarnings use Data::Dump qw(dd); for my $s ( '2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd', '2021-08-18 a=bcd', '2021-08-19 a=b', '2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy', ) { my $matched = $s =~ # /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails + all # /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails +some /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works # /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works ; dd $s, $1, $2 if $matched; # dd $s, $1, $2; # ??? } ^Z ("2021-08-16 foo a=bcd bar", "2021-08-16", "bcd") ("2021-08-17 foo a=bcd", "2021-08-17", "bcd") ("2021-08-18 a=bcd", "2021-08-18", "bcd") ("2021-08-19 a=b", "2021-08-19", "b") ("2021-08-20 a=", "2021-08-20", undef) ("2021-08-21 ", "2021-08-21", undef)
    /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/   This is a lot better. It captures the assignment variable in every case in which it is fully present, even when it's preceded by junk.

    The whole (?:.*a=([a-z]+))? expression is optional, but within the expression, the a=([a-z] must match (even if preceded by junk) and if it matches, the variable will be captured.

    /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/   What happens if the .* is changed to .*?, i.e., made lazy? Try it for yourself. Is there any difference in output? Can you explain what's going on?

    This is a bit off-topic, but what's with the commented-out
      # dd $s, $1, $2;  # ???
    statement at the end of the code? If you un-comment this statement and comment out the
        dd $s, $1, $2 if $matched;
    statement that's been used so far, how does the displayed output differ? Do we start to see "dates" extracted from strings from which they should not be extracted, like '2021-08-22' (no required space following the date substring) and 'xyzzy' (no date substring whatsoever)? Why does "dates" have scare-quotes? What's going on here?

    And yes, regexes be tricky.


    Give a man a fish:  <%-{-{-{-<