Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: TreeBuilder and encoding

by Anonymous Monk
on Jul 15, 2013 at 03:19 UTC ( [id://1044260]=note: print w/replies, xml ) Need Help??


in reply to TreeBuilder and encoding

What am I not getting?

Now I looked, you also need to read https://metacpan.org/module/HTML::TreeBuilder#parse_file because treebuilder is interpreting those UTF-8-encoded-bytes as latin-1

This works

#!/usr/bin/perl -- use strict; use warnings; use autodie; use WWW::Mechanize 1.72; #~ use HTML::TreeBuilder::XPath; #~ use HTML::TreeBuilder::LibXML; Main( @ARGV ); exit( 0 ); sub xtree { local $@; if( eval { require HTML::TreeBuilder::LibXML; } ){ return HTML::TreeBuilder::LibXML->new; } if( eval { require HTML::TreeBuilder::XPath; } ){ return HTML::TreeBuilder::XPath->new; } die "$@ you need to install use HTML::TreeBuilder::XPath or use HTML::TreeBuilder::LibXML\n\n"; } sub Main { my $url = shift or die "\n\nUsage: $0 http....\n\n"; binmode STDOUT, ':encoding(UTF-8)'; ## grr my $mech = WWW::Mechanize->new( autocheck => 1 ); my $tree = xtree(); $mech->get( $url ); $tree->parse( $mech->content ); for my $node ( $tree->findnodes( q{ //span[ @itemprop="reviewBody" + ] } ) ){ print $node->as_HTML, "\n\n"; print $node->as_text, "\n\n"; print $node->as_HTML( q{<&>} ), "\n\n"; } } __END__

Replies are listed 'Best First'.
Re^2: TreeBuilder and encoding
by VineMob (Initiate) on Jul 15, 2013 at 13:53 UTC
    Now I looked, you also need to read https://metacpan.org/module/HTML::TreeBuilder#parse_file because treebuilder is interpreting those UTF-8-encoded-bytes as latin-1

    I had read the latin-1 issue as having to do with file opening. Since I am passing data to parse() as a string, I thought that wouldnt apply. Thinking about it, that may be a poor assumption, but I do note that the parse() call does not mention charsets at all.

    This works

    Hmmm.... Ill have to spend some time looking at the changes you made.

      you're using content which is bytes, mech content is chars ie decoded_content

      Mechanize is for you :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1044260]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-04-24 07:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found