VineMob has asked for the wisdom of the Perl Monks concerning the following question:

I am having trouble with HTML::TreeBuilder and utf-8 encoding/decoding, particularly with the ’ character.

Here is a stripped down version of my code:

UPDATE:code modified to correct syntax error.

require HTTP::Request; require LWP::UserAgent; use HTML::Entities; use HTML::TreeBuilder; use Encode; $ua = LWP::UserAgent->new; # Get the page $request = HTTP::Request->new("GET", "http://buyingguide.winemag.com/c +atalog/peju-1998-reserve-cabernet-sauvignon-napa-rutherford"); $response = $ua->request($request); $body = $response->content(); #dump the file open (DMP, ">", "dumpfile.html"); print DMP $body; close DMP; #parse it $root = HTML::TreeBuilder->new; $root->parse($body); $root->eof; $review_et = $root->look_down('itemprop','reviewBody'); print $review_et->as_HTML . "\n"; $review = $review_et->as_text; print $review . "\n";

When I view the webpage in a browser it contains the string, "many ’98 Cabs". That same string shows up in the source for the page, so its not encoded in the source. The string shows up in the dumpfile.html as well. But after parsing in HTML::TreeBuilder, as_HTML prints it as "many ’98 Cabs" and as_text prints it as "many Γاض98 Cabs".

Setting or unsetting utf8_mode doesn't solve the problem, actually, setting seems to exacerbate it. Ive tried explicitly setting STDOUT to utf-8 via binmode, but that doesn't help either. encode_utf8 or decode_utf8 before printing is also no help. Ive seen several other questions here about TreeBuilder and utf-8 similar to mine, but the solutions there have not appeared to solve my problem.

What am I not getting?

Replies are listed 'Best First'.
Re: TreeBuilder and encoding
by Anonymous Monk on Jul 15, 2013 at 03:19 UTC

    What am I not getting?

    Now I looked, you also need to read https://metacpan.org/module/HTML::TreeBuilder#parse_file because treebuilder is interpreting those UTF-8-encoded-bytes as latin-1

    This works

    #!/usr/bin/perl -- use strict; use warnings; use autodie; use WWW::Mechanize 1.72; #~ use HTML::TreeBuilder::XPath; #~ use HTML::TreeBuilder::LibXML; Main( @ARGV ); exit( 0 ); sub xtree { local $@; if( eval { require HTML::TreeBuilder::LibXML; } ){ return HTML::TreeBuilder::LibXML->new; } if( eval { require HTML::TreeBuilder::XPath; } ){ return HTML::TreeBuilder::XPath->new; } die "$@ you need to install use HTML::TreeBuilder::XPath or use HTML::TreeBuilder::LibXML\n\n"; } sub Main { my $url = shift or die "\n\nUsage: $0 http....\n\n"; binmode STDOUT, ':encoding(UTF-8)'; ## grr my $mech = WWW::Mechanize->new( autocheck => 1 ); my $tree = xtree(); $mech->get( $url ); $tree->parse( $mech->content ); for my $node ( $tree->findnodes( q{ //span[ @itemprop="reviewBody" + ] } ) ){ print $node->as_HTML, "\n\n"; print $node->as_text, "\n\n"; print $node->as_HTML( q{<&>} ), "\n\n"; } } __END__
      Now I looked, you also need to read https://metacpan.org/module/HTML::TreeBuilder#parse_file because treebuilder is interpreting those UTF-8-encoded-bytes as latin-1

      I had read the latin-1 issue as having to do with file opening. Since I am passing data to parse() as a string, I thought that wouldnt apply. Thinking about it, that may be a poor assumption, but I do note that the parse() call does not mention charsets at all.

      This works

      Hmmm.... Ill have to spend some time looking at the changes you made.

        you're using content which is bytes, mech content is chars ie decoded_content

        Mechanize is for you :)

Re: TreeBuilder and encoding
by Anonymous Monk on Jul 15, 2013 at 02:41 UTC
Re: TreeBuilder and encoding
by 2teez (Vicar) on Jul 15, 2013 at 08:52 UTC

    ..Ive tried explicitly setting STDOUT to utf-8 via binmode..
    Actually, your STDOUT should have been set to use ":encoding(UTF-8)" instead of "utf-8". Why?

    ":utf8" just marks the data as UTF-8 without further checking, while ":encoding(UTF-8)" checks the data for actually being valid UTF-8

    this works for me though:

    use warnings; use strict; use utf8; use LWP::UserAgent; use HTML::TreeBuilder; my $url = 'http://buyingguide.winemag.com/catalog/peju-1998-reserve-cabernet-sau +vignon-napa-rutherford'; my $browser = LWP::UserAgent->new; my $re = $browser->get($url); if ( $re->is_success ) { my $tree = HTML::TreeBuilder->new; $tree->parse( $re->decoded_content ); $tree->eof(); binmode STDOUT, ":encoding(UTF-8)"; my $review_et = $tree->look_down( 'itemprop', 'reviewBody' ); my $str = $review_et->as_text; #print $str,$/; # this as works print $review_et->as_HTML; $tree->delete; } else { die $re->status_line(); }

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
      ..Ive tried explicitly setting STDOUT to utf-8 via binmode.. Actually, your STDOUT should have been set to use ":encoding(UTF-8)" instead of "utf-8".

      Sorry for being imprecise, I had in fact tried both ":encoding(UTF-8)" and "utf-8".

      this works for me though:

      hmmm.... it actually doesnt for me. Which suggests it may be a platform issue. Ill try it on a few different boxen and let you know how it works out.

        I think that you are working it a little to hard. There is no "utf-8", but there is ":utf8". I always use ":encoding(UTF-8)", just to be safe.

        Here's what I did: If you use the new_from_url method, then it will call LWP::UserAgent for you.
        #!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder 5 -weak; my $url = 'http://buyingguide.winemag.com/catalog/peju-1998-reserve-cabern +et-sauvignon-napa-rutherford'; my $tree = HTML::TreeBuilder->new_from_url( $url ); $tree->parse_content( $url ); my $review_et = $tree->look_down('itemprop', 'reviewBody'); binmode STDOUT, ":encoding(UTF-8)"; print $review_et->as_text; $tree->delete;
Re: TreeBuilder and encoding
by VineMob (Initiate) on Jul 16, 2013 at 01:55 UTC

    Id like to thank everyone for their help. It appears to be a platform issue:

    While as_HTML continues to print many &acirc;&#128;&#153;98 Cabs, as_text prints correctly on debian and cygwin. I had originally been using ActiveSync and the issue seems to be confined to that version of perl.

    Once again, thank you for your help.