Non-greedy substitution

Bod has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Non-greedy substitution by choroba (Cardinal) on Nov 15, 2024 at 20:12 UTC
The frugal quantifier means shortest, but still leftmost. You can use a greedy .* at the beginning to consume as much as it can, keep it, and then replace the comma with "and". `#!/usr/bin/perl use warnings; use strict; use experimental qw( signatures ); sub non_oxford_list($s) { $s =~ s/^.\K,/ and/r } use Test::More tests => 3; is non_oxford_list('A'), 'A'; is non_oxford_list('A, B'), 'A and B'; is non_oxford_list('A, B, C'), 'A, B and C';` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]`	[reply] [d/l] [select]
Re^2: Non-greedy substitution by Bod (Parson) on Nov 15, 2024 at 21:35 UTC
The frugal quantifier means shortest, but still leftmost Thank you...it was the "leftmost" that was missing from my thought process! I think I've understood now 👍	[reply]
Re^3: Non-greedy substitution by LanX (Saint) on Nov 18, 2024 at 02:18 UTC
> I think* I've understood now 👍* maybe this helps, you can reproduce it in the debugger started with `perl -de0` `DB<22> p $_ = join "," , A..E A,B,C,D,E DB<23> p m/(,.,)/ # longest possibility from first comma to last c +omma ,B,C,D, DB<24> p m/(,.?,)/ # shortest possibility from first comma to next +comma ,B, DB<25> p m/(,.$)/ # longest possibility from first comma to end of + line ,B,C,D,E DB<26> p m/(,.?$)/ # shortest possibility from first comma to end o +f line ,B,C,D,E DB<27>` [download] the regex-engine tries to find a solution step by step: first it tries the first pattern, here `"," = comma` then it matches `"." = all` as many times like quantified ( `"" or "?"` ) till it matches the next pattern ( `"," or "$" = EOL` ) IFF not all criteria can be met, it'll try to start anew from the next comma, and so on The problem with your regex was, that it was already matching from the leftmost comma. But all solutions provided by other monks made sure that only the rightmost comma allowed to be a match. For instance `DB<27> p m/(,[^,])$/ # comma followed by non-commas till EOL ,E` [download] The engine will actually try to first match all other commas to the left but always fail because it encounters other commas before reaching the EOL. we can actually make the regex display it's intermediate attempts to match while "backtracking" `DB<32> ; m/(,[^,]) (?{say $1}) $/x #show all intermediate attempts +to match $1 until it doesn't fail ,B , ,C , ,D , ,E DB<33>` [download] The difference with non-greedy quantifier `?` matching is that the engine goes from shortest to longest attempts while backtracking `DB<34> ; m/(,[^,]?) (?{say $1}) $/x #show all intermediate attempts + to match $1 , ,B , ,C , ,D , ,E DB<35>` [download] Is this clearer now? :) HTH! Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^4: Non-greedy substitution by Bod (Parson) on Nov 21, 2024 at 23:14 UTC
Re^5: Non-greedy substitution by LanX (Saint) on Nov 22, 2024 at 15:31 UTC
Some notes below your chosen depth have not been shown here
Re^2: Non-greedy substitution by Bod (Parson) on Nov 15, 2024 at 21:46 UTC
Am I right the `\K` is Keeping everything upto that point and not substituting that part? It is everything to the left and not just the 'thing' immediately before `/K`?	[reply] [d/l] [select]
Re^3: Non-greedy substitution by Corion (Patriarch) on Nov 15, 2024 at 21:53 UTC
If you're interested in seeing how the Regex engine steps through a string, you can load Regexp::Debugger. Alternatively, the first edition of Mastering Regular Expressions explains in-depth how Perls RE engine matches and compares that to other engines.	[reply]
Re: Non-greedy substitution by tybalt89 (Monsignor) on Nov 15, 2024 at 20:42 UTC
Just because I've never seen 'reductions' used before :) `#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11162727 use warnings; use List::Util qw( reductions ); use Data::Dump 'dd'; dd map s/.*\K,/ and/r, reductions { "$a, $b" } 'A'..'Z';` [download] Outputs: ( "A", "A and B", "A, B and C", "A, B, C and D", "A, B, C, D and E", "A, B, C, D, E and F", "A, B, C, D, E, F and G", "A, B, C, D, E, F, G and H", "A, B, C, D, E, F, G, H and I", "A, B, C, D, E, F, G, H, I and J", "A, B, C, D, E, F, G, H, I, J and K", "A, B, C, D, E, F, G, H, I, J, K and L", "A, B, C, D, E, F, G, H, I, J, K, L and M", "A, B, C, D, E, F, G, H, I, J, K, L, M and N", "A, B, C, D, E, F, G, H, I, J, K, L, M, N and O", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O and P", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P and Q", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q and R", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R and S", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S and T", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T and U", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U and V +", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V an +d W", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W + and X", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W +, X and Y", "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W +, X, Y and Z", ) [download]	[reply] [d/l] [select]
Re: Non-greedy substitution by ikegami (Patriarch) on Nov 15, 2024 at 19:19 UTC
«`,`» matches a comma, then «`.+?`» matches the least possible, then it matches the end of the string or a LF at the end of the string. `01234567 position A, B, C` [download] Full: Start matching at position 0. At position 0, «`,`» doesn't match. ⇒ Backtrack. Start matching at position 1. At position 1, «`,`» matches 1 character. At position 2, «`.+?`» matches 1 characters. At position 3, «`$`» doesn't match. ⇒ Backtrack. At position 2, «`.+?`» matches 2 characters. At position 4, «`$`» doesn't match. ⇒ Backtrack. At position 2, «`.+?`» matches 3 characters. At position 5, «`$`» doesn't match. ⇒ Backtrack. At position 2, «`.+?`» matches 4 characters. At position 6, «`$`» doesn't match. ⇒ Backtrack. At position 2, «`.+?`» matches 5 characters. At position 7, «`$`» matches 0 characters. ⇒ Success. Summary: Starts matching at position 1. At position 1, «`,`» matches 1 character. At position 2, «`.+?`» matches 5 characters. At position 7, «`$`» matches 0 characters. If «`.+?`» were to match any less, the «`$`» wouldn't match. Solution: `sub join_list { return "none" if !@_; # ??? my $last = pop; return $last if !@_; return join( ", ", @_ ) . " and " . $last; }` [download]	[reply] [d/l] [select]
Re^2: Non-greedy substitution by Bod (Parson) on Nov 15, 2024 at 19:38 UTC
Solution: `sub join_list { return "none" if !@_; # ??? my $last = pop; return $last if !@_; return join( ", ", @_ ) . " and " . $last; }` [download] An interesting solution. However, in my quest to understand what is going on, I tried forcing the match to be non-comma characters and came up with this which produces the desired behaviour. `perl -e "my $test = join ', ', ('A', 'B', 'C');$test =~ s/,([^,]+?)$/ +and$1/; print $test;"` [download] I still don't understand why the original doesn't work. Surely `,.+?$` is the shortest possible match within the string that starts with a comma and ends at the end of the line...	[reply] [d/l] [select]
Re^3: Non-greedy substitution by ikegami (Patriarch) on Nov 15, 2024 at 20:00 UTC
Your mental model of what «`.+?`» does is severely flawed. For starters, it doesn't permit patterns to have multiple subpatterns that can match substrings of different lengths. «`.+?`» does not mean "the shortest possible match within the string that starts with a comma". «`.+`» means "one or more non-LF characters, trying in order of decreasing length", and «`.+?`» means "one or more non-LF characters, trying in order of increasing length". Note that lack of mention of comma. «`.+?`» doesn't do any checks related to commas. The comma is matched independently.	[reply] [d/l] [select]
Re^4: Non-greedy substitution by Bod (Parson) on Nov 15, 2024 at 21:38 UTC
Re^5: Non-greedy substitution by ikegami (Patriarch) on Nov 18, 2024 at 01:34 UTC
Re^2: Non-greedy substitution by Bod (Parson) on Nov 15, 2024 at 19:27 UTC
If `.+?` were to match any less, $ wouldn't match. I'm sorry, but I don't understand why `.+?` doesn't match 2 characters at position 5 - the match has to be tied to the end of the string...doesn't it?	[reply] [d/l] [select]
Re^3: Non-greedy substitution by ikegami (Patriarch) on Nov 15, 2024 at 19:55 UTC
There can't be gaps in what matches. �`.+?`� must start matching where �`,`� left off. I added a "full" trace to my post.	[reply] [d/l] [select]
Re^3: Non-greedy substitution by Paladin (Vicar) on Nov 15, 2024 at 19:58 UTC
The regex engine prioritizes "leftmost". So it will always find the left most place the entire regex will match.	[reply]
Re^4: Non-greedy substitution by ikegami (Patriarch) on Nov 15, 2024 at 20:14 UTC
Re: Non-greedy substitution by LanX (Saint) on Nov 15, 2024 at 22:33 UTC
And here the classical way to catch the last segment. Reusing choroba's tests. `#!/usr/bin/perl use warnings; use strict; use experimental qw( signatures ); sub non_oxford_list($s) { $s =~ s/,([^,]+)$/ and$1/r } use Test::More tests => 3; is non_oxford_list('A'), 'A'; is non_oxford_list('A, B'), 'A and B'; is non_oxford_list('A, B, C'), 'A, B and C';` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l]
Re^2: Non-greedy substitution by LanX (Saint) on Nov 16, 2024 at 08:17 UTC
> the classical way There are more than one way to go old-school `use warnings; use strict; sub non_oxford_list { my ($s) = @_; my $p = rindex $s, ','; substr ($s,$p,1) = " and" if $p > -1; return $s; } use Test::More tests => 3; is non_oxford_list('A'), 'A'; is non_oxford_list('A, B'), 'A and B'; is non_oxford_list('A, B, C'), 'A, B and C';` [download] FWIW: `rindex` , `substr` Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l]
Re: Non-greedy substitution by jwkrahn (Abbot) on Nov 15, 2024 at 23:50 UTC
`$test =~ s/,(.+?)$/ and$1/;` ALWAYS goes to the first comma and the rest is anchored to the end of the line so it will always match everything to the end of the line. Without the anchor it would match the next character after the comma (the space character.) There are a few different ways to match the last comma: $ perl -le'my @x = ( "A", "A, B", "A, B, C" ); # 1a for my $test ( @x ) { $test =~ s/(.),/$1 and/s; print $test; } ' A A and B A, B and C $ perl -le'my @x = ( "A", "A, B", "A, B, C" ); # 1b for my $test ( @x ) { $test =~ s/.\K,/ and/s; print $test; } ' A A and B A, B and C $ perl -le'my @x = ( "A", "A, B", "A, B, C" ); # 2a for my $test ( @x ) { $test =~ s/,([^,])\z/ and$1/; print $test; } ' A A and B A, B and C $ perl -le'my @x = ( "A", "A, B", "A, B, C" ); # 2b for my $test ( @x ) { $test =~ s/,(?=[^,]\z)/ and/; print $test; } ' A A and B A, B and C $ perl -le'my @x = ( "A", "A, B", "A, B, C" ); # 3 for my $test ( @x ) { $test = reverse $test; $test =~ s/,/dna /; $test = reverse $test; print $test; } ' A A and B A, B and C [download] Which one you use will depend on how much data you have to process. Naked blocks are fun! -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: Non-greedy substitution by sleet (Pilgrim) on Nov 15, 2024 at 23:54 UTC
You don't need to use a regex here. Here's a function I use frequently: `sub name_join { return '' unless @_; return $_[0] if 1 == @_; return join ', ', @_[0 .. $#_ - 2], "$_[-2] and $_[-1]"; }` [download]	[reply] [d/l]