in reply to Re: Re: Regex: How do I use lookahead with search/replace?
in thread Regex: How do I use lookahead with search/replace?

Those tags are strange in html because if they are started within a block-level html tag, they stop working if other block-level elements are inside.

I wonder if this might be a function of the particular browser you're using to display the content. I tried an example like the following in a Mozilla-Firebird browser just now (on macos X), and the "ins" and "del" segments ran their full extent, spanning and including the blockquote contents:

<html> <P> Starting here <ins> this is inserted text <blockquote> This is a block quote that is part of the inserted text. </blockquote> this is the end of the inserted text. </ins> the original paragraph goes into details which get deleted. <del> like this stuff here <blockquote> and the stuff in this blockquote is also deleted </blockquote> and this stuff too. </del> That leaves this part in. </html>
Now, I could imagine cases where the extent of an "ins" or "del" block might run afoul of the markup hierarchy -- e.g. if such a region started outside a blockquote and ended inside it:
<P> some text <del> some deleted text <!-- need to add end-del here <blockquote> <!-- need to add start-del here deleted partion of quote </del> retained portion of quote </blockquote retained portion of paragraph. </p>
Cases like this can be handled pretty well with HTML::TokeParser, given that you know what you need to look for. You could step through the data one "token" at a time, determine what sort of token it is (open tag, close tag, text content, comment) and throw in the extra tags where they're needed.

(This time, I'm not posting code because I think others, like jeffa could do it better and quicker than I could, and because it's late and I should go to bed. But now it is an interesting problem, and I'd be curious to know what you're actually doing to get from the RCS diff data to the placement of "ins" and "del" tags...)

Replies are listed 'Best First'.
Re: Re: Re: Re: Regex: How do I use lookahead with search/replace?
by tunesmith (Initiate) on Feb 22, 2004 at 09:50 UTC
    That is interesting it works on Mozilla. It doesn't on Safari, but I just assumed it wouldn't other places because I was going by what style-sheets.com said regarding "ins/del" tags - that they can be inline or block elements, but "should not be used" as block elements if they are inside other block elements. Since blockquote is a block element, and your example has "ins" inside another block element, then it could be that it's just one of those things that Mozilla decides to handle since it happens to be nice.

    Here are the details of what I did since you're curious. I am using Rcs::Agent to get diffs in the unified format. I think it's one of the best Rcs libraries. (The only thing that would be better for me would be a library that would allow me to get all lines of a file rather than just the 3 contextual lines surrounding the change.)

    Then among other things, I:

    foreach (@$diff) { if (m/^\@\@(.*)\@\@$/) { $str .= qq{<p id="rev">$1</p>}; } elsif (m/^\-(.*)$/) { push @$old, $1; } elsif (m/^\+(.*)$/) { push @$new, $1; } else { my $tst = $_; $str .= getHTMLDiff($old, $new) . $tst; undef $old; undef $new; } } $str .= getHTMLDiff($old, $new);
    getHTMLDiff basically passes $old and $new to HTML::Diff, a nifty little library in CPAN that returns word-level changes and has a few html-friendly features.

    So that's the tangent. If jeffa or you or others have more suggestions I'm happy and thankful to read them; in the meantime I'll go check out HTML::TokeParser. Thanks. :-)

    PS You can see the in-progress results of this over at my weblog http://www.museworld.com/ by looking for entries on the front page that have a little "Revision" link at the bottom. I'm basically writing a movable type plugin to allow people to keep revision history for their weblog entries.