in reply to Regex: How do I use lookahead with search/replace?

(This is all a simplified case of what I'm actually trying to do.)
If you are actually trying to do something more complicated than this, and if everything you're doing involves HTML or XML markup, then for the sake of your own sanity (and that of anyone else who might look at or work with your code), use an appropriate parser module (HTML::TokeParser, XML::Parser, or simple variants thereof) -- you'll see this advice repeated often at the monastery, and with good reason. Once you grok using one of these modules, the job will be easier, and there won't be any risk of duplicating tags by mistake with a regex.

update: I thought about posting some sample code, but decided not to, because you say the given example isn't exactly what you're really trying to do, and because the given example in itself doesn't really make sense -- it would seem more sensible if the intended output looked like this:

<p> <b> blah blah <blockquote> blah </blockquote> blah </b> </p>
or like <b> <p> ... </p> </b> -- and when you use a parser module, this is the sort of modification that would be both natural and trivial to do.

Replies are listed 'Best First'.
Re: Re: Regex: How do I use lookahead with search/replace?
by tunesmith (Initiate) on Feb 22, 2004 at 07:30 UTC
    Thanks for the detail - for context, here is exactly what I'm trying to do.

    I am doing revision control for html documents. I am getting diffs (in unified format) from the rcs files. I want to communicate these diffs in html format using the <ins></ins>  <del></del> html tags.

    Those tags are strange in html because if they are started within a block-level html tag, they stop working if other block-level elements are inside.

    So,

    <ins> <p>blah blah<blockquote>blah</blockquote>blah<p> </ins>
    will not render properly; meaning the style elements given to <ins> will not continue until the end of the <ins> block.

    Also,

    <p><ins>blah blah<blockquote>blah</blockquote>blah</ins><p>
    will not render either. My best idea for a workaround is:
    <p><ins>blah blah</ins> <blockquote><ins>blah</ins></blockquote> <ins>blah</ins> <p>
    (Note that some of the html has just <p> tags, while other html has the <p></p> pairings, or </p>)

    So I'm basically trying to take segments of html, which might be awkward segments like blah blah</i><b>blah blah and put in the <ins></ins>  <del></del> tags where appropriate.

      Those tags are strange in html because if they are started within a block-level html tag, they stop working if other block-level elements are inside.

      I wonder if this might be a function of the particular browser you're using to display the content. I tried an example like the following in a Mozilla-Firebird browser just now (on macos X), and the "ins" and "del" segments ran their full extent, spanning and including the blockquote contents:

      <html> <P> Starting here <ins> this is inserted text <blockquote> This is a block quote that is part of the inserted text. </blockquote> this is the end of the inserted text. </ins> the original paragraph goes into details which get deleted. <del> like this stuff here <blockquote> and the stuff in this blockquote is also deleted </blockquote> and this stuff too. </del> That leaves this part in. </html>
      Now, I could imagine cases where the extent of an "ins" or "del" block might run afoul of the markup hierarchy -- e.g. if such a region started outside a blockquote and ended inside it:
      <P> some text <del> some deleted text <!-- need to add end-del here <blockquote> <!-- need to add start-del here deleted partion of quote </del> retained portion of quote </blockquote retained portion of paragraph. </p>
      Cases like this can be handled pretty well with HTML::TokeParser, given that you know what you need to look for. You could step through the data one "token" at a time, determine what sort of token it is (open tag, close tag, text content, comment) and throw in the extra tags where they're needed.

      (This time, I'm not posting code because I think others, like jeffa could do it better and quicker than I could, and because it's late and I should go to bed. But now it is an interesting problem, and I'd be curious to know what you're actually doing to get from the RCS diff data to the placement of "ins" and "del" tags...)

        That is interesting it works on Mozilla. It doesn't on Safari, but I just assumed it wouldn't other places because I was going by what style-sheets.com said regarding "ins/del" tags - that they can be inline or block elements, but "should not be used" as block elements if they are inside other block elements. Since blockquote is a block element, and your example has "ins" inside another block element, then it could be that it's just one of those things that Mozilla decides to handle since it happens to be nice.

        Here are the details of what I did since you're curious. I am using Rcs::Agent to get diffs in the unified format. I think it's one of the best Rcs libraries. (The only thing that would be better for me would be a library that would allow me to get all lines of a file rather than just the 3 contextual lines surrounding the change.)

        Then among other things, I:

        foreach (@$diff) { if (m/^\@\@(.*)\@\@$/) { $str .= qq{<p id="rev">$1</p>}; } elsif (m/^\-(.*)$/) { push @$old, $1; } elsif (m/^\+(.*)$/) { push @$new, $1; } else { my $tst = $_; $str .= getHTMLDiff($old, $new) . $tst; undef $old; undef $new; } } $str .= getHTMLDiff($old, $new);
        getHTMLDiff basically passes $old and $new to HTML::Diff, a nifty little library in CPAN that returns word-level changes and has a few html-friendly features.

        So that's the tangent. If jeffa or you or others have more suggestions I'm happy and thankful to read them; in the meantime I'll go check out HTML::TokeParser. Thanks. :-)

        PS You can see the in-progress results of this over at my weblog http://www.museworld.com/ by looking for entries on the front page that have a little "Revision" link at the bottom. I'm basically writing a movable type plugin to allow people to keep revision history for their weblog entries.