Your comment appears to be what I needed to hear as changing the non-greedy match from /".*?"/ to /"[^"]*?"/ appears to work correctly. The negated character class was the trick.
I'm still a bit confused about why there is such a difference in what is matched, but I'm think about it some more.
Thanks for pointing me in the right direction.
| [reply] [d/l] [select] |
perl -e'
$_ = qq{...\n}
.qq{<a href="foo">foo</a>\n}
.qq{<a href="bar">bar</a>\n}
.qq{...\n};
s!(<a href=")(.*?)(">bar</a>)!$1\[$2]$3!s;
print;
'
to output
...
<a href="foo">foo</a>
<a href="[bar]">bar</a>
...
but that's wrong. It outputs
...
<a href="[foo">foo</a>
<a href="bar]">bar</a>
...
The pattern says to match
- Match the start of the string,
- followed by as few characters as possible (implicit leading /.*?/),
- followed by the string '<a href="',
- followed by as few characters as possible,
- followed by the string '">bar</a>'.
Keeping in mind that "as few characters as possible" is zero characters, let's check if the string matches:
- Starting at the begining of the string,
- Do 0 characters follow? Yes, so try to match the next atom.
- Does the string '<a href="' follow? No, so backtrack.
- Does 1 character follow? Yes, so try to match the next atom.
- Does the string '<a href="' follow? No, so backtrack.
- Do 2 characters follow? Yes, so try to match the next atom.
- Does the string '<a href="' follow? No, so backtrack.
- ...
- Do 4 characters follow? Yes, so try to match the next atom.
- Does the string '<a href="' follow? Yes, so try to match the next atom.
- Do 0 characters follow? Yes, so try to match the next atom.
- Does the string '">bar</a>' follow? No, so backtrack.
- Does 1 character follow? Yes, so try to match the next atom.
- Does the string '">bar</a>' follow? No, so backtrack.
- Do 2 characters follow? Yes, so try to match the next atom.
- Does the string '">bar</a>' follow? No, so backtrack.
- ...
- Do 25 characters follow? Yes, so try to match the next atom.
- Does the string '">bar</a>' follow? Yes, so try to match the next atom.
| [reply] [d/l] [select] |
Okay, maybe I'm reading more into your response than I should, but here are two questions:
- is there any difference between /".+?"/ and /".*?"/? Yes, + matches one or more of the previous pattern, and * matches zero or more of the previous pattern, but given that all strings seen in the table are more than one character in length, is there any difference since I am specifying that the pattern is non-greedy?
- is not the regular expression originally quoted non-greedy?
Thanks for any insight shared.
| [reply] |
is not the regular expression originally quoted non-greedy?
It is. But what do you expect non-greedy to be? Some people think that non-greedy means "match an as short string as possible", without anything else. But there is just one such a string, and that's the empty string.
Non-greedy does not mean, "don't match where you would match otherwise". If a pattern matches with greedy (sub) matches, it will match with non-greedy sub matches. And if a pattern doesn't match with non-greedy sub matches, it will not match with greedy sub matches.
All greedy/non-greedy will do is change $&, it will not change whether or not a pattern matches.
| [reply] |