in reply to Non-greedy substitution

«,» matches a comma, then «.+?» matches the least possible, then it matches the end of the string or a LF at the end of the string.

01234567 position A, B, C

Full:

  1. Start matching at position 0.
    1. At position 0, «,» doesn't match. ⇒ Backtrack.
  2. Start matching at position 1.
    1. At position 1, «,» matches 1 character.
      1. At position 2, «.+?» matches 1 characters.
        1. At position 3, «$» doesn't match. ⇒ Backtrack.
      2. At position 2, «.+?» matches 2 characters.
        1. At position 4, «$» doesn't match. ⇒ Backtrack.
      3. At position 2, «.+?» matches 3 characters.
        1. At position 5, «$» doesn't match. ⇒ Backtrack.
      4. At position 2, «.+?» matches 4 characters.
        1. At position 6, «$» doesn't match. ⇒ Backtrack.
      5. At position 2, «.+?» matches 5 characters.
        1. At position 7, «$» matches 0 characters. ⇒ Success.

Summary:

  1. Starts matching at position 1.
  2. At position 1, «,» matches 1 character.
  3. At position 2, «.+?» matches 5 characters.
  4. At position 7, «$» matches 0 characters.

If «.+?» were to match any less, the «$» wouldn't match.

Solution:

sub join_list { return "none" if !@_; # ??? my $last = pop; return $last if !@_; return join( ", ", @_ ) . " and " . $last; }

Replies are listed 'Best First'.
Re^2: Non-greedy substitution
by Bod (Parson) on Nov 15, 2024 at 19:38 UTC
    Solution:
    sub join_list { return "none" if !@_; # ??? my $last = pop; return $last if !@_; return join( ", ", @_ ) . " and " . $last; }

    An interesting solution.

    However, in my quest to understand what is going on, I tried forcing the match to be non-comma characters and came up with this which produces the desired behaviour.

    perl -e "my $test = join ', ', ('A', 'B', 'C');$test =~ s/,([^,]+?)$/ +and$1/; print $test;"

    I still don't understand why the original doesn't work. Surely ,.+?$ is the shortest possible match within the string that starts with a comma and ends at the end of the line...

      Your mental model of what «.+?» does is severely flawed. For starters, it doesn't permit patterns to have multiple subpatterns that can match substrings of different lengths.

      «.+?» does not mean "the shortest possible match within the string that starts with a comma".

      «.+» means "one or more non-LF characters, trying in order of decreasing length", and
      «.+?» means "one or more non-LF characters, trying in order of increasing length".

      Note that lack of mention of comma. «.+?» doesn't do any checks related to commas. The comma is matched independently.

        Your mental model of what «.+?» does is severely flawed

        Well, yes...hence the need to ask the question...

        «.+?» doesn't do any checks related to commas

        But I included the comma in my example...perhaps I could have formatted it so it was more prominent.

Re^2: Non-greedy substitution
by Bod (Parson) on Nov 15, 2024 at 19:27 UTC
    If .+? were to match any less, $ wouldn't match.

    I'm sorry, but I don't understand why .+? doesn't match 2 characters at position 5 - the match has to be tied to the end of the string...doesn't it?

      There can't be gaps in what matches. «.+?» must start matching where «,» left off. I added a "full" trace to my post.

      The regex engine prioritizes "leftmost". So it will always find the left most place the entire regex will match.

        That's wrong.

        That's the same mindset as saying .*? prioritizes shortest. But we all know that mindset is flawed, since that's the issue at hand.

        For example, your explanation doesn't work for \G(?s:.)*?\K,(.+?)$, which is the OP's pattern with the implicit bits made explicit.