tunesmith has asked for the wisdom of the Perl Monks concerning the following question:

I want to change this data:
<p> blah blah <blockquote> blah </blockquote> blah <p>
into this:
<p> <b>blah blah</b> <blockquote> <b>blah</b> </blockquote> <b>blah</b> <p>
I tried:
s/(<.*?>)(.*?)(<.*?>)/$1<b>$2</b>$3/g;
But that didn't touch the code inside of <blockquote> because the marker had already gone past that tag on the first match.

I switched to lookahead/lookback:

s/(?=(<.*?>)(.*?)(<.*?>))/$1<b>$2</b>$3/g;
But that created extra code; a nested blockquote for example.

(This is all a simplified case of what I'm actually trying to do.) In summary: How do I do search/replace if a pattern matches both the end of one and the beginning of another string to replace? Alternative solutions: What is the best way to either tell the "replace" part to only print out chunks of code it hasn't printed before, OR, how can I add additional search criteria to the "search" part without it becoming part of the replacement? Like, "search for this whole string, but only replace this portion of that string with this other replacement string".

Thanks, tunesmith

Replies are listed 'Best First'.
Re: Regex: How do I use lookahead with search/replace?
by Roger (Parson) on Feb 22, 2004 at 04:18 UTC
    use strict; my $html = do { local $/; <DATA> }; $html =~ s/(<[^>]+>\s*)(.*)(?=\s*<[^>]+>)/$1<b>$2<\/b>/gm; print "$html\n"; __DATA__ <p> blah blah <blockquote> blah </blockquote> blah <p>

    And the output -
    <p> <b>blah blah</b> <blockquote> <b>blah</b> </blockquote> <b>blah</b> <p>

Re: Regex: How do I use lookahead with search/replace?
by graff (Chancellor) on Feb 22, 2004 at 06:26 UTC
    (This is all a simplified case of what I'm actually trying to do.)
    If you are actually trying to do something more complicated than this, and if everything you're doing involves HTML or XML markup, then for the sake of your own sanity (and that of anyone else who might look at or work with your code), use an appropriate parser module (HTML::TokeParser, XML::Parser, or simple variants thereof) -- you'll see this advice repeated often at the monastery, and with good reason. Once you grok using one of these modules, the job will be easier, and there won't be any risk of duplicating tags by mistake with a regex.

    update: I thought about posting some sample code, but decided not to, because you say the given example isn't exactly what you're really trying to do, and because the given example in itself doesn't really make sense -- it would seem more sensible if the intended output looked like this:

    <p> <b> blah blah <blockquote> blah </blockquote> blah </b> </p>
    or like <b> <p> ... </p> </b> -- and when you use a parser module, this is the sort of modification that would be both natural and trivial to do.
      Thanks for the detail - for context, here is exactly what I'm trying to do.

      I am doing revision control for html documents. I am getting diffs (in unified format) from the rcs files. I want to communicate these diffs in html format using the <ins></ins>  <del></del> html tags.

      Those tags are strange in html because if they are started within a block-level html tag, they stop working if other block-level elements are inside.

      So,

      <ins> <p>blah blah<blockquote>blah</blockquote>blah<p> </ins>
      will not render properly; meaning the style elements given to <ins> will not continue until the end of the <ins> block.

      Also,

      <p><ins>blah blah<blockquote>blah</blockquote>blah</ins><p>
      will not render either. My best idea for a workaround is:
      <p><ins>blah blah</ins> <blockquote><ins>blah</ins></blockquote> <ins>blah</ins> <p>
      (Note that some of the html has just <p> tags, while other html has the <p></p> pairings, or </p>)

      So I'm basically trying to take segments of html, which might be awkward segments like blah blah</i><b>blah blah and put in the <ins></ins>  <del></del> tags where appropriate.

        Those tags are strange in html because if they are started within a block-level html tag, they stop working if other block-level elements are inside.

        I wonder if this might be a function of the particular browser you're using to display the content. I tried an example like the following in a Mozilla-Firebird browser just now (on macos X), and the "ins" and "del" segments ran their full extent, spanning and including the blockquote contents:

        <html> <P> Starting here <ins> this is inserted text <blockquote> This is a block quote that is part of the inserted text. </blockquote> this is the end of the inserted text. </ins> the original paragraph goes into details which get deleted. <del> like this stuff here <blockquote> and the stuff in this blockquote is also deleted </blockquote> and this stuff too. </del> That leaves this part in. </html>
        Now, I could imagine cases where the extent of an "ins" or "del" block might run afoul of the markup hierarchy -- e.g. if such a region started outside a blockquote and ended inside it:
        <P> some text <del> some deleted text <!-- need to add end-del here <blockquote> <!-- need to add start-del here deleted partion of quote </del> retained portion of quote </blockquote retained portion of paragraph. </p>
        Cases like this can be handled pretty well with HTML::TokeParser, given that you know what you need to look for. You could step through the data one "token" at a time, determine what sort of token it is (open tag, close tag, text content, comment) and throw in the extra tags where they're needed.

        (This time, I'm not posting code because I think others, like jeffa could do it better and quicker than I could, and because it's late and I should go to bed. But now it is an interesting problem, and I'd be curious to know what you're actually doing to get from the RCS diff data to the placement of "ins" and "del" tags...)

Re: Regex: How do I use lookahead with search/replace?
by revdiablo (Prior) on Feb 23, 2004 at 01:17 UTC

    Here's a simple regex solution I came up with. It's off the cuff and has a few quirks, but it might give you some ideas:

    local $_ = do { local $/ = undef; <DATA> }; s[ (?<=>) # start at an '>' ([^<]+) # match all the following non '<' chars ] [<b>$1</b>]gx; print; __DATA__ <p> blah blah <blockquote> blah </blockquote> blah <p>

    And the output is:

    <p><b> blah blah </b><blockquote><b> blah </b></blockquote><b> blah </b><p><b> </b>

    You'll note the fact that the <b>'s get inserted before and after the newlines in a strange fashion. That's one of the quirks. Another one is the <b> at the end that encloses nothing except a newline. This is because [^>] matches any character, including a newline character. You can tweak to suit if this is a problem.