Re: Munging Rendered HTML While Preserving Formatting
by ViceRaid (Chaplain) on Jun 28, 2004 at 16:55 UTC
|
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new(*DATA);
while ( my $token = $p->get_token() ) {
if ( $token->is_text() ) {
$token->[1] =~ s/2004/2006/;
}
print $token->as_is;
}
__DATA__
<html>
<head>
</head>
<body>
<h1 id="2004">Euro 2004 : The English were robbed</h1>
<p>We <strong>will</strong> have revenge in the 2006 World Cup!</p>
<!-- Last edited in 2004 -->
</body>
</html>
The 2004 occurring as an HTML attribute and the 2004 in the comment remain unchanged.
Cheers
ViceRaid
Used H:T:Simple's nice ->is_text() method | [reply] [d/l] |
|
|
<html>
<head></head>
<body>h<i>e</i>ll<b>o</b></body>
</html>
As idsfa points out, this isn't an easy problem given that the replacement text may not be as long as the original text. This makes the problem even more interesting to me - I don't do HTML data munging if at all humanly possible - what are other people doing?
| [reply] [d/l] |
|
|
I had to do something a bit like this. I worked at a typesetting company where the typsetters used xml-like tagging. It wasn't xml, because it had no requirements to be balanced, well formed or anything. They had a long book, marked up like that. I was give a bunch of text documents that matched the 'text' (sans tags) of the book. The text documents had index tags put in like this: <index1235>this is indexed</index1235>. These could overlap, be nested, etc. To make matters worse, the rtf documents were based an outdated copy of the book text - many corrections and addition had been made to it. My task was to try and insert the index tags into the correct place in the xml-like text.
So I read in the book file, stripped out every tag, space and punctuation character (these were 'corrected' more often than regular text), and stored it aside with a note of its position. Then I read through the index file and tried to match strings (100 chars) starting from each index tag against the book text, and, if found, added the index tag into the tag list the match position. Then I put he book file back together again, starting from the back so as not to mess up the character positions.
But with html that can be parsed its easier. Some tags are stylistic, and some "semantic" (ok, sort of). While these strings could be considered equivalent:
<i>Apple</i> Juice
<b>Apple </b>Juice
its unlikely that this would be:
<h1>Apple</h1> <h1>Juice</h1>
So I think I'd only do substitutions within one "semantic" tag. If the strings are variable length, you've got to talk with the client about what to do about formatting tags. I think it real world situations its not likely to be a problem. You'd probaby get a spec like s/bug/issue/g, and you'd only want to match whole words. Or you'd get a paragraph to replace: s/I have no comment/I refer you to <a href="s@e.com">my solicitor</a>/g. In that case, you may want to match "I have <b>no comment</b>", but you would still use the replacement string intact. qq | [reply] [d/l] [select] |
|
|
There are problems if the replacement text is longer, shorter, or the same size. If the text is longer, where do you put the extra? If the text is shorter, where do you remove the characters? If the text is the same length, do you break it in the same way?
This is really only a problem when doing replacements with sentences instead of words. It is pretty unlikely that a word will be split in non-pathological cases. It can be argued that a tag is equivalent to a word break. The problem is actually pretty similar to doing munging across line breaks.
The only sane is to do replacement on individual text blocks. It might be possible to do replacements on multiple words, either by using something like XSLT that works on the tree. The other way to do would write regexp that match whitespace and elements as word separators. For XML, this would not be too hard. The other hard part is maintaining the tags when doing the substitution.
| [reply] |
Re: Munging Rendered HTML While Preserving Formatting
by ihb (Deacon) on Jun 28, 2004 at 19:32 UTC
|
| [reply] [d/l] |
Re: Munging Rendered HTML While Preserving Formatting
by idsfa (Vicar) on Jun 28, 2004 at 15:57 UTC
|
It isn't perl, but I use lynx for this task:
$nicely_formatted = `lynx -dump $URL`;
Updated: nevermind, I did not parse the question correctly (see response below). For clarity, you are looking for this sort of transform?:
fo<b>o</b> => ba<b>r</b>
If so, see the response below Re^3: Munging Rendered HTML While Preserving Formatting
If anyone needs me I'll be in the Angry Dome.
| [reply] [d/l] [select] |
|
|
idsfa,
I am not sure I made the problem clear as your response at first glance isn't appropriate. The task is to change foo to bar in rendered HTML while keeping the original formatting. What it boils down to changing foo to bar in the underlying HTML (which may involve imbedded tags) so that the rendered HTML looks like you did s/foo/bar/g
| [reply] |
|
|
In general, I don't think you can get there from here. Consider the transform s/foo/fishstick/g. How do you transform the HTML fo<b>o</b>?
Assuming you constrain the replacement to have the same length as the original, something like this would do the job:
use Regexp::Common;
while( $html =~ s/($RE{balanced}{-parens=>'<>'})// )
{
$tags{$-[0]} .= $1;
}
$html =~ s/foo/bar/g;
foreach my $point (sort {$b<=>$a} keys (%tags))
{
substr($html, $point, 0 ) = $tags{$point};
}
For the pathological case of a tag with an attribute containing a '>' -- at this point you know as well as I do that you're into a full HTML parser:
use HTML::Parser;
# Remove the s///g from this one to leave tags alone
# Alternately, specify additional methods to alter only
# specific token types
sub tagpush {$_ = shift; s/foo/bar/g; $tags{length($html)} .= $_ ;}
sub txtpush { $html .= "@_"; }
my $p = HTML::Parser->new(unbroken_text => 1,
text_h => [ \&txtpush, "text" ],
default_h => [ \&tagpush, "text" ],
);
my $file = shift || usage();
$p->parse_file($file) || die "Can't open file $file: $!\n";
$html =~ s/foo/bar/g;
foreach my $point (sort {$b<=>$a} keys (%tags))
{
substr($html, $point, 0 ) = $tags{$point};
}
This last once handles cases like f<!-- -->oo, f<b>oo</b> and <img alt=">foo"> properly as well, which a token parser will not catch.
If anyone needs me I'll be in the Angry Dome.
| [reply] [d/l] [select] |
|
|
Re: Munging Rendered HTML While Preserving Formatting
by PodMaster (Abbot) on Jun 29, 2004 at 09:06 UTC
|
I'd be tempted to solve this using a browser (createTextRange, findText, pasteHTML), but for perl, HTML::TreeBuilder (a DOM approach) would be a good choice.
The basic approach is that you create a tree out of the html, and then scan it for text which you then try to match ... basicallly you'd implement TextRanges in perl (without all the rendering related stuff of course).
update: I should note that HTML::Tree doesn't preserve the formatting of its input exactly, but thats not implicitly a bad thing.
To begin is as simple as
use strict;
use warnings;
use HTML::TreeBuilder;
my $body = HTML::TreeBuilder->new_from_content(
'h<b>e</b>l<i>lo</i>!!!'
)->find_by_tag_name('body');
if( $body->as_text =~ /hello!!!/ ){
print $_,$/ for $body->content_list;
}
__END__
h
HTML::Element=HASH(0x1a540e0)
l
HTML::Element=HASH(0x1a54140)
!!!
Hopefully that'll help you see the forest :)
| MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] [d/l] |
Re: Munging Rendered HTML While Preserving Formatting
by BbTrumpet (Acolyte) on Jun 30, 2004 at 16:35 UTC
|
And if you thought it was a mess so far, how about this possibile HTML source when you want to replace all instances of the word "his" with the word "her" (or "this" with "that"):
<p><img src="letter_t.gif" align="left">his is a test.
It is only a test.
If this had been an actual emergency, yadda yadda yadda....</p> | [reply] |
(a solution) Re: Munging Rendered HTML While Preserving Formatting
by PodMaster (Abbot) on Jul 22, 2004 at 06:46 UTC
|
use strict;
use warnings;
use HTML::HiLiter;
my $hiliter = HTML::HiLiter::->new;
$hiliter->Queries([
'foo',
'bar',
'"some phrase"',
],
);
$hiliter->CSS;
$hiliter->Run(\q~
<html>
<title>hi</title>
<style type="text/css">
.hilite2, .hilite1 { /* so you can see whats hilited */
color: red !;
}
</style>
<body>hi there I say <b>f<i>o</i>o</b> there <tt>some phrase</tt>
</body></html>
~);
__END__
<html>
<title>hi</title>
<style type="text/css">
.hilite2, .hilite1 { /* so you can see whats hilited */
color: red !;
}
</style>
<body>hi there I say <b><span class='hilite2'>f</span><i><span class='
+hilite2'>o</span></i><span class='hilite2'>o</span></b> there <tt><sp
+an class='hilite1'>some phrase</span></tt>
</body></html>
| MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] [d/l] |