cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Goo day bros. I am fetching a web page into $response with LWP::UserAgent. I then execute the following code to get it parsed into a tree:
my $html = $response->decoded_content(); my $tree = HTML::TreeBuilder->new; $tree->parse($html);
Problem is that one (but not all) of the web sites returns content in UTF-8, and for this I get a warning "Parsing of undecoded UTF-8 will give garbage when decoding entities at foobar.pl line 15."

But wait, I need to have the HTML parsed into a tree before I can find the tag that gives me the charset!

I realize I could extract the raw html and use a regexp or something to hunt for the content type tag before parsing the tree, but I'm wondering if there's a more elegant solution to this.

TIA...Steve

Replies are listed 'Best First'.
Re: HTTP::Response decoded_content catch 22
by ikegami (Patriarch) on Aug 09, 2008 at 00:41 UTC
    I think HTML::TreeBuilder only works on bytes (not characters) since it's made to work on files. I had problems getting it to work with characters and ended up decoding everything I extracted from the tree. The way to go might be
    my $html_file = encode( 'ASCII', $response->decoded_content(), Encode::FB_HTMLCREF ); my $tree = HTML::TreeBuilder->new(); $tree->parse($html_file);

    and use utf8::decode on everything extracted from the tree if necessary.

    It has the advantage of handling multiple byte encoding if HTML::Parser doesn't, and one knows which encoding to use when decoding data extracted from the tree.

Re: HTTP::Response decoded_content catch 22
by Your Mother (Archbishop) on Aug 08, 2008 at 23:00 UTC

    I tried a few different web sites, utf8 and not, and couldn't get that error to fire with that code. What version of LWP are you using? And can you give a website that triggers the error?

      It's one of Barack Obama's news release pages, for example http://www.barackobama.com/2008/08/07/obama_talks_about_reviving_eco.php
        my $ua = LWP::UserAgent->new(); my $response = $ua->get("http://www.barackobama.com/2008/08/07/obama_t +alks_about_reviving_eco.php"); # print $response->decoded_content; + my $html = $response->decoded_content(); my $tree = HTML::TreeBuilder->new; $tree->parse($html); print $tree->as_HTML;

        That gives no errors for me either. And gets and parses the contents fine. Both on LWP 5.805 with perl 5.10 and LWP 5.814 on Perl 5.8.8. Time to upgrade?