I am performing a substitution on strings that fall between two anchors. A certain substring -- say, "cd" -- may or may not appear as part of the string I'm matching. If it does, I need to capture it.

In my examples below, the anchors are commas, but in reality they are complex regular expressions, so that I can't just use [^,]* to avoid steamrolling over them.

In essence, the substitution will be some variation on

s/,.*(cd)?.*,/=$1=/

Making the two .* subexpressions nongreedy while keeping the (cd)? greedy would seem to express exactly what I'm trying to do, except for the slight inconvenience that it doesn't work:

echo ,abcdefg,abcdefg | perl -pe 's/,.*?(cd)?.*?,/=$1=/'

I want a captured "cd" between the two equal signs, but the $1 remains empty. The problem seems to be that .*? is apparently not merely nongreedy but also unaccommodating: It won't even consume enough to allow the following greedy subexpression to match. It's not clear to me why a nongreedy expression would consume enough to match a required subexpression (i.e. if I omitted the ? after (cd)), but not an optional but greedy one.

I can make the expression capture the "cd" if I make my first subexpression a little more explicit:

echo ,abcdefg,abcdefg, | perl -pe 's/,(?:(?!cd).)*(cd)?.*?,/=$1=/'

This gives me the desired output of "=cd=abcdefg,". But this fails if the part between the anchors does not contains a "cd":

echo ,abcefg,abcdefg, | perl -pe 's/,(?:(?!cd).)*(cd)?.*?,/=$1=/'

Here, the desired output is "==abcdefg,", but the greedy subexpression ignores the anchor boundary and goes into the section of the string following it to find a "cd".

I've tried various other things but not yet found something that works. How do I get the $1 to be populated with a "cd" if it appears in the string, and remain empty if it doesn't, while staying between the anchors?


In reply to greedy subexpression between two nongreedy ones by raygun

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.