HTML::Tree problems with UTF-8 Content.

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

So I'm parsing some HTML with HTML::TreeBuilder, my new favourite module:

my $tree = HTML::TreeBuilder->new_from_content($page);
$tree->elementify();
[download]

at which point it dies, saying "Parsing of undecoded UTF-8 will give garbage when decoding entities at /Library/Perl/HTML/TreeBuilder.pm line 96.".

Now my first problem is that line 96 is the rather unedifying "$new->parse($whunk);". So, after a certain amount of trial and error I track that down to one of the dependent modules, HTML::Parser, which tells me that "The solution is to use the Encode::encode_utf8() on the data before feeding it to the $p->parse()".

So I do this:

$page = Encode::encode_utf8($page);
my $tree = HTML::TreeBuilder->new_from_content($page);
$tree->elementify();
[download]

But it doesn't seem to make any difference. Same error.

So I guess I have three questions:

What's the best way to track down error messages like that? I was completely mystified and there were a lot of other modules involved.
What should I be doing to ensure the Parser won't choke on this particular HTML?
In the more general case, where some HTML will be UTF-8 and some won't, how do I code? I can't utf-encode all HTML, just in case it's UTF-8, but Parser chokes as soon as it finds the HTML is UTF-8, so how do I create an if-clause which figures that out ahead of time?

($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Comment on HTML::Tree problems with UTF-8 Content. Select or Download Code

Replies are listed 'Best First'.
Re: HTML::Tree problems with UTF-8 Content. by ikegami (Patriarch) on Jul 16, 2005 at 06:20 UTC
1. What's the best way to track down error messages like that? I was completely mystified and there were a lot of other modules involved. If the message is being outputed by `Carp::carp`, try: `perl -MCarp=verbose script.pl` [download] This will cause `carp` to spit out a stack trace.	[reply] [d/l] [select]
Re: HTML::Tree problems with UTF-8 Content. by GrandFather (Saint) on Jul 16, 2005 at 06:32 UTC
Can you reduce the HTML content to a minimum that demonstrates the problem and append it to your post so we can try an reproduce the problem? Perl is Huffman encoded by design.	[reply]
Re: HTML::Tree problems with UTF-8 Content. by graff (Chancellor) on Jul 16, 2005 at 14:42 UTC
I've had this sort of trouble with the HTML parsing modules too. For reasons that probably make sense for some applications, the parsing strategy seems to be based on pulling fixed-size chunks of bytes from whatever input (even a scalar string, which seems odd) -- and then doing some operations on the chunks where perl 5.8's flexibility ("byte-semantics" vs. "character-semantics") goes awry, and ends up trying to do utf8 operations on a character where the final byte or two fell on the wrong side of a chunk boundary. If you pass the parser a file handle as input, do not open that file handle with ":utf8" or any other encoding pragma that would cause the data to be converted to utf8 on input (via a PerlIO layer). If you pass it a scalar string, make sure that it is a string that does not have the utf8 flag set. (See the Encode man page about the utf8 flag.) After the parsing is done, use Encode::decode() on the various pieces of text content if you need to do utf8 character-based stuff with it. I presume this is a difficult design issue for the HTML parsing modules -- or maybe it's something that would be fairly easy to fix -- but it certainly is a problem.	[reply]