Munging Rendered HTML While Preserving Formatting

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Munging Rendered HTML While Preserving Formatting
by ViceRaid (Chaplain) on Jun 28, 2004 at 16:55 UTC

HTML::TokeParser::Simple makes this task very simple:

use HTML::TokeParser::Simple;

my $p = HTML::TokeParser::Simple->new(*DATA);

while ( my $token = $p->get_token() ) {
    if ( $token->is_text() ) {
        $token->[1] =~ s/2004/2006/;
    }
    print $token->as_is;
}

__DATA__
<html>
<head>
</head>
<body>
<h1 id="2004">Euro 2004 : The English were robbed</h1>
<p>We <strong>will</strong> have revenge in the 2006 World Cup!</p>
<!-- Last edited in 2004 -->
</body>
</html>
[download]

The 2004 occurring as an HTML attribute and the 2004 in the comment remain unchanged.

Cheers
ViceRaid

Used H:T:Simple's nice ->is_text() method

[reply]
[d/l]

Re^2: Munging Rendered HTML While Preserving Formatting

by Limbic~Region (Chancellor) on Jun 28, 2004 at 17:44 UTC

ViceRaid

HTML::TokeParser::Simple makes this task very simple:

Not really, but I guess it is my fault for not being clear. If you look, this is the same module that I had mentioned that doesn't meet all the requirements.

<html>
<head></head>
<body>h<i>e</i>ll<b>o</b></body>
</html>
[download]

idsfa

Cheers - L~R

[reply]
[d/l]

Re^3: Munging Rendered HTML While Preserving Formatting

by qq (Hermit) on Jun 28, 2004 at 21:08 UTC

I had to do something a bit like this. I worked at a typesetting company where the typsetters used xml-like tagging. It wasn't xml, because it had no requirements to be balanced, well formed or anything. They had a long book, marked up like that.

I was give a bunch of text documents that matched the 'text' (sans tags) of the book. The text documents had index tags put in like this: <index1235>this is indexed</index1235>. These could overlap, be nested, etc. To make matters worse, the rtf documents were based an outdated copy of the book text - many corrections and addition had been made to it. My task was to try and insert the index tags into the correct place in the xml-like text.

So I read in the book file, stripped out every tag, space and punctuation character (these were 'corrected' more often than regular text), and stored it aside with a note of its position. Then I read through the index file and tried to match strings (100 chars) starting from each index tag against the book text, and, if found, added the index tag into the tag list the match position. Then I put he book file back together again, starting from the back so as not to mess up the character positions.

But with html that can be parsed its easier. Some tags are stylistic, and some "semantic" (ok, sort of). While these strings could be considered equivalent:

<i>Apple</i> Juice
<b>Apple </b>Juice
[download]

its unlikely that this would be:

<h1>Apple</h1> <h1>Juice</h1>
[download]

So I think I'd only do substitutions within one "semantic" tag.

If the strings are variable length, you've got to talk with the client about what to do about formatting tags. I think it real world situations its not likely to be a problem. You'd probaby get a spec like s/bug/issue/g, and you'd only want to match whole words. Or you'd get a paragraph to replace: s/I have no comment/I refer you to <a href="s@e.com">my solicitor</a>/g. In that case, you may want to match "I have no comment", but you would still use the replacement string intact.

[reply]
[d/l]
[select]

Re^3: Munging Rendered HTML While Preserving Formatting

by iburrell (Chaplain) on Jun 28, 2004 at 19:54 UTC

This is really only a problem when doing replacements with sentences instead of words. It is pretty unlikely that a word will be split in non-pathological cases. It can be argued that a tag is equivalent to a word break. The problem is actually pretty similar to doing munging across line breaks.

The only sane is to do replacement on individual text blocks. It might be possible to do replacements on multiple words, either by using something like XSLT that works on the tree. The other way to do would write regexp that match whitespace and elements as word separators. For XML, this would not be too hard. The other hard part is maintaining the tags when doing the substitution.

[reply]

Re: Munging Rendered HTML While Preserving Formatting
by ihb (Deacon) on Jun 28, 2004 at 19:32 UTC

I solved a problem of this kind at Re: Matching across newlines without stripping them out. That problem was simpler as it only preserved newlines (or any other characters), whereas you want to preserve tags. I don't recommend you to use it as it stands as HTML and regexes usually make a fragile combination, but it may provide some food for thought.

It's kind of a reversed approach of idsfa's Re^3: Munging Rendered HTML While Preserving Formatting.

Hope this helps,
ihb

[reply]
[d/l]

Re: Munging Rendered HTML While Preserving Formatting
by idsfa (Vicar) on Jun 28, 2004 at 15:57 UTC

It isn't perl, but I use lynx for this task:

$nicely_formatted = `lynx -dump $URL`;
[download]

Updated: nevermind, I did not parse the question correctly (see response below). For clarity, you are looking for this sort of transform?:

fo<b>o</b>  =>  ba<b>r</b>
[download]

If so, see the response below Re^3: Munging Rendered HTML While Preserving Formatting

If anyone needs me I'll be in the Angry Dome.

[reply]
[d/l]
[select]

Re^2: Munging Rendered HTML While Preserving Formatting

by Limbic~Region (Chancellor) on Jun 28, 2004 at 16:02 UTC

idsfa

Cheers - L~R

[reply]

Re^3: Munging Rendered HTML While Preserving Formatting

by idsfa (Vicar) on Jun 28, 2004 at 16:52 UTC

In general, I don't think you can get there from here. Consider the transform s/foo/fishstick/g. How do you transform the HTML foo?

Assuming you constrain the replacement to have the same length as the original, something like this would do the job:

   use Regexp::Common;

   while( $html =~ s/($RE{balanced}{-parens=>'<>'})// )
   {
      $tags{$-[0]} .= $1;
   }

   $html =~ s/foo/bar/g;

   foreach my $point (sort {$b<=>$a} keys (%tags))
   {
      substr($html, $point, 0 ) = $tags{$point};
   }
[download]

For the pathological case of a tag with an attribute containing a '>' -- at this point you know as well as I do that you're into a full HTML parser:

   use HTML::Parser;

   # Remove the s///g from this one to leave tags alone
   # Alternately, specify additional methods to alter only
   # specific token types
   sub tagpush {$_ = shift; s/foo/bar/g; $tags{length($html)} .= $_ ;}
   sub txtpush { $html .= "@_";  }

   my $p = HTML::Parser->new(unbroken_text => 1,
        text_h    => [ \&txtpush, "text" ],
        default_h => [ \&tagpush, "text" ],
       );

   my $file = shift || usage();
   $p->parse_file($file) || die "Can't open file $file: $!\n";

   $html =~ s/foo/bar/g;

   foreach my $point (sort {$b<=>$a} keys (%tags))
   {
      substr($html, $point, 0 ) = $tags{$point};
   }
[download]

This last once handles cases like foo, foo and <img alt=">foo"> properly as well, which a token parser will not catch.

If anyone needs me I'll be in the Angry Dome.

[reply]
[d/l]
[select]

Re^4: Munging Rendered HTML While Preserving Formatting

by Limbic~Region (Chancellor) on Jun 28, 2004 at 17:47 UTC

Re: Munging Rendered HTML While Preserving Formatting
by PodMaster (Abbot) on Jun 29, 2004 at 09:06 UTC

HTML::TreeBuilder

TextRange

update: I should note that HTML::Tree doesn't preserve the formatting of its input exactly, but thats not implicitly a bad thing. To begin is as simple as

use strict;
use warnings;
use HTML::TreeBuilder;

my $body = HTML::TreeBuilder->new_from_content(
    'h<b>e</b>l<i>lo</i>!!!'
)->find_by_tag_name('body');

if( $body->as_text   =~ /hello!!!/ ){
    print $_,$/ for $body->content_list;
}

__END__
h
HTML::Element=HASH(0x1a540e0)
l
HTML::Element=HASH(0x1a54140)
!!!
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re: Munging Rendered HTML While Preserving Formatting
by BbTrumpet (Acolyte) on Jun 30, 2004 at 16:35 UTC

<p><img src="letter_t.gif" align="left">his is a test.
It is only a test.
If this had been an actual emergency, yadda yadda yadda....</p>

[reply]

(a solution) Re: Munging Rendered HTML While Preserving Formatting
by PodMaster (Abbot) on Jul 22, 2004 at 06:46 UTC

HTML::HiLiter

use strict;
use warnings;
use HTML::HiLiter;
my $hiliter = HTML::HiLiter::->new;
$hiliter->Queries([
        'foo',
        'bar',
        '"some phrase"',
    ],
);
$hiliter->CSS;
$hiliter->Run(\q~
<html>
<title>hi</title>
<style type="text/css">
.hilite2, .hilite1 { /* so you can see whats hilited */
    color: red !;
}
</style>
<body>hi there I say <b>f<i>o</i>o</b> there <tt>some phrase</tt>
</body></html>
~);
__END__
<html>
<title>hi</title>
<style type="text/css">
.hilite2, .hilite1 { /* so you can see whats hilited */
    color: red !;
}
</style>
<body>hi there I say <b><span class='hilite2'>f</span><i><span class='
+hilite2'>o</span></i><span class='hilite2'>o</span></b> there <tt><sp
+an class='hilite1'>some phrase</span></tt>
</body></html>
[download]

[reply]
[d/l]