... the real pattern I'm matching is more like (untested)

/(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/
...
What should I have done here to have something always match that date and occasionally also have the variable assignment later in the string?

I've put together some example strings (which you did not provide) and some regexes to try to answer your question.

Consider:

Win8 Strawberry 5.8.9.5 (32) Mon 08/16/2021 20:36:57 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings use Data::Dump qw(dd); for my $s ( '2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd', '2021-08-18 a=bcd', '2021-08-19 a=b', '2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy', ) { my $matched = $s =~ /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails + all # /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails +some # /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works # /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works ; dd $s, $1, $2 if $matched; # dd $s, $1, $2; # ??? } ^Z ("2021-08-16 foo a=bcd bar", "2021-08-16", undef) ("2021-08-17 foo a=bcd", "2021-08-17", undef) ("2021-08-18 a=bcd", "2021-08-18", undef) ("2021-08-19 a=b", "2021-08-19", undef) ("2021-08-20 a=", "2021-08-20", undef) ("2021-08-21 ", "2021-08-21", undef)
/(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/   (Update: This is the regex quoted above.) This fails to extract the optional assignment variable in all cases. Why? (Note that the date substring is always properly extracted, as also in all code below.)

.* is greedy and will consume everything (except, by default, newlines) to the end of the string. Then (?:a=([a-z]+)) tries to match and cannot because the match point is at the end of the string. That's OK because (?:a=([a-z]+))? is optional (update: and so the RE need not backtrack); the overall match can succeed. However, the assignment variable is never captured because .* has already run past it in the string: it's not there to capture.

Next:

>perl -Mstrict -Mwarnings use Data::Dump qw(dd); for my $s ( '2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd', '2021-08-18 a=bcd', '2021-08-19 a=b', '2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy', ) { my $matched = $s =~ # /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails + all /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails +some # /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works # /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works ; dd $s, $1, $2 if $matched; # dd $s, $1, $2; # ??? } ^Z ("2021-08-16 foo a=bcd bar", "2021-08-16", undef) ("2021-08-17 foo a=bcd", "2021-08-17", undef) ("2021-08-18 a=bcd", "2021-08-18", "bcd") ("2021-08-19 a=b", "2021-08-19", "b") ("2021-08-20 a=", "2021-08-20", undef) ("2021-08-21 ", "2021-08-21", undef)
/(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/   (Update: This is the regex quoted above.) Making .*? lazy helps a bit, but some of the variables that are present are still not captured. (Again, the date substrings are always captured.)

Failure to capture happens when something like 'foo' is present before the assignment substring. (I assume junk like 'foo' may be present because what's the point of the .* otherwise?) If .*? matches and is immediately followed by (?:a=([a-z]+))?, the assignment will be matched and the variable captured. If there is anything (e.g., 'foo') following the .*? that is not an assignment substring, the .*? will match and there will be an overall match because (?:a=([a-z]+))? is still completely optional; the assignment variable will not be captured.

What about:

>perl -Mstrict -Mwarnings use Data::Dump qw(dd); for my $s ( '2021-08-16 foo a=bcd bar', '2021-08-17 foo a=bcd', '2021-08-18 a=bcd', '2021-08-19 a=b', '2021-08-20 a=', '2021-08-21 ', '2021-08-22', 'xyzzy', ) { my $matched = $s =~ # /(\d\d\d\d-\d\d-\d\d) .*(?:a=([a-z]+))?/ # .* greedy - fails + all # /(\d\d\d\d-\d\d-\d\d) .*?(?:a=([a-z]+))?/ # .*? lazy - fails +some /(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/ # works # /(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/ # works ; dd $s, $1, $2 if $matched; # dd $s, $1, $2; # ??? } ^Z ("2021-08-16 foo a=bcd bar", "2021-08-16", "bcd") ("2021-08-17 foo a=bcd", "2021-08-17", "bcd") ("2021-08-18 a=bcd", "2021-08-18", "bcd") ("2021-08-19 a=b", "2021-08-19", "b") ("2021-08-20 a=", "2021-08-20", undef) ("2021-08-21 ", "2021-08-21", undef)
/(\d\d\d\d-\d\d-\d\d) (?:.*a=([a-z]+))?/   This is a lot better. It captures the assignment variable in every case in which it is fully present, even when it's preceded by junk.

The whole (?:.*a=([a-z]+))? expression is optional, but within the expression, the a=([a-z] must match (even if preceded by junk) and if it matches, the variable will be captured.

/(\d\d\d\d-\d\d-\d\d) (?:.*?a=([a-z]+))?/   What happens if the .* is changed to .*?, i.e., made lazy? Try it for yourself. Is there any difference in output? Can you explain what's going on?

This is a bit off-topic, but what's with the commented-out
  # dd $s, $1, $2;  # ???
statement at the end of the code? If you un-comment this statement and comment out the
    dd $s, $1, $2 if $matched;
statement that's been used so far, how does the displayed output differ? Do we start to see "dates" extracted from strings from which they should not be extracted, like '2021-08-22' (no required space following the date substring) and 'xyzzy' (no date substring whatsoever)? Why does "dates" have scare-quotes? What's going on here?

And yes, regexes be tricky.


Give a man a fish:  <%-{-{-{-<


In reply to Re: Making a subpattern optional with ? causes subpattern match to fail. I am confused as to why. by AnomalousMonk
in thread Making a subpattern optional with ? causes subpattern match to fail. I am confused as to why. by msouth

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.