in reply to Re: HTTP::Response decoded_content catch 22
in thread HTTP::Response decoded_content catch 22

It's one of Barack Obama's news release pages, for example http://www.barackobama.com/2008/08/07/obama_talks_about_reviving_eco.php
  • Comment on Re^2: HTTP::Response decoded_content catch 22

Replies are listed 'Best First'.
Re^3: HTTP::Response decoded_content catch 22
by Your Mother (Archbishop) on Aug 09, 2008 at 15:44 UTC
    my $ua = LWP::UserAgent->new(); my $response = $ua->get("http://www.barackobama.com/2008/08/07/obama_t +alks_about_reviving_eco.php"); # print $response->decoded_content; + my $html = $response->decoded_content(); my $tree = HTML::TreeBuilder->new; $tree->parse($html); print $tree->as_HTML;

    That gives no errors for me either. And gets and parses the contents fine. Both on LWP 5.805 with perl 5.10 and LWP 5.814 on Perl 5.8.8. Time to upgrade?

      It still exists in the latest version. The docs say
      (W) The first chunk parsed appears to contain undecoded UTF-8 and one or more argspecs that decode entities are used for the callback handlers. The result of decoding will be a mix of encoded and decoded characters for any entities that expand to characters with code above 127. This is not a good thing. The solution is to use the Encode::encode_utf8() on the data before feeding it to the $p->parse(). For $p->parse_file() pass a file that has been opened in ":utf8" mode. The parser can process raw undecoded UTF-8 sanely if the C<utf8_mode> is enabled or if the "attr", "@attr" or "dtext" argspecs is avoided.

      It could be that that server didn't specify the character encoding of the content.

      Well I upgraded to 5.814 (from 5.805) and that does seem to have fixed the problem. Thanks.