HTTP::Response decoded_content catch 22

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Goo day bros. I am fetching a web page into $response with LWP::UserAgent. I then execute the following code to get it parsed into a tree:

my $html = $response->decoded_content();
my $tree = HTML::TreeBuilder->new;
$tree->parse($html);
[download]

Problem is that one (but not all) of the web sites returns content in UTF-8, and for this I get a warning "Parsing of undecoded UTF-8 will give garbage when decoding entities at foobar.pl line 15."

But wait, I need to have the HTML parsed into a tree before I can find the tag that gives me the charset!

I realize I could extract the raw html and use a regexp or something to hunt for the content type tag before parsing the tree, but I'm wondering if there's a more elegant solution to this.

TIA...Steve

Comment on HTTP::Response decoded_content catch 22 Download Code

Replies are listed 'Best First'.
Re: HTTP::Response decoded_content catch 22 by ikegami (Patriarch) on Aug 09, 2008 at 00:41 UTC
I think HTML::TreeBuilder only works on bytes (not characters) since it's made to work on files. I had problems getting it to work with characters and ended up decoding everything I extracted from the tree. The way to go might be `my $html_file = encode( 'ASCII', $response->decoded_content(), Encode::FB_HTMLCREF ); my $tree = HTML::TreeBuilder->new(); $tree->parse($html_file);` [download] and use `utf8::decode` on everything extracted from the tree if necessary. It has the advantage of handling multiple byte encoding if HTML::Parser doesn't, and one knows which encoding to use when decoding data extracted from the tree.	[reply] [d/l] [select]
Re: HTTP::Response decoded_content catch 22 by Your Mother (Archbishop) on Aug 08, 2008 at 23:00 UTC
I tried a few different web sites, utf8 and not, and couldn't get that error to fire with that code. What version of LWP are you using? And can you give a website that triggers the error?	[reply]
Re^2: HTTP::Response decoded_content catch 22 by cormanaz (Deacon) on Aug 09, 2008 at 15:18 UTC
It's one of Barack Obama's news release pages, for example http://www.barackobama.com/2008/08/07/obama_talks_about_reviving_eco.php	[reply]
Re^3: HTTP::Response decoded_content catch 22 by Your Mother (Archbishop) on Aug 09, 2008 at 15:44 UTC
`my $ua = LWP::UserAgent->new(); my $response = $ua->get("http://www.barackobama.com/2008/08/07/obama_t +alks_about_reviving_eco.php"); # print $response->decoded_content; + my $html = $response->decoded_content(); my $tree = HTML::TreeBuilder->new; $tree->parse($html); print $tree->as_HTML;` [download] That gives no errors for me either. And gets and parses the contents fine. Both on LWP 5.805 with perl 5.10 and LWP 5.814 on Perl 5.8.8. Time to upgrade?	[reply] [d/l]
Re^4: HTTP::Response decoded_content catch 22 by ikegami (Patriarch) on Aug 09, 2008 at 15:58 UTC
Re^4: HTTP::Response decoded_content catch 22 by cormanaz (Deacon) on Aug 09, 2008 at 16:28 UTC