VineMob has asked for the wisdom of the Perl Monks concerning the following question:
I am having trouble with HTML::TreeBuilder and utf-8 encoding/decoding, particularly with the ’ character.
Here is a stripped down version of my code:
UPDATE:code modified to correct syntax error.
require HTTP::Request; require LWP::UserAgent; use HTML::Entities; use HTML::TreeBuilder; use Encode; $ua = LWP::UserAgent->new; # Get the page $request = HTTP::Request->new("GET", "http://buyingguide.winemag.com/c +atalog/peju-1998-reserve-cabernet-sauvignon-napa-rutherford"); $response = $ua->request($request); $body = $response->content(); #dump the file open (DMP, ">", "dumpfile.html"); print DMP $body; close DMP; #parse it $root = HTML::TreeBuilder->new; $root->parse($body); $root->eof; $review_et = $root->look_down('itemprop','reviewBody'); print $review_et->as_HTML . "\n"; $review = $review_et->as_text; print $review . "\n";
When I view the webpage in a browser it contains the string, "many ’98 Cabs". That same string shows up in the source for the page, so its not encoded in the source. The string shows up in the dumpfile.html as well. But after parsing in HTML::TreeBuilder, as_HTML prints it as "many ’98 Cabs" and as_text prints it as "many Γاض98 Cabs".
Setting or unsetting utf8_mode doesn't solve the problem, actually, setting seems to exacerbate it. Ive tried explicitly setting STDOUT to utf-8 via binmode, but that doesn't help either. encode_utf8 or decode_utf8 before printing is also no help. Ive seen several other questions here about TreeBuilder and utf-8 similar to mine, but the solutions there have not appeared to solve my problem.
What am I not getting?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: TreeBuilder and encoding
by Anonymous Monk on Jul 15, 2013 at 03:19 UTC | |
by VineMob (Initiate) on Jul 15, 2013 at 13:53 UTC | |
by Anonymous Monk on Jul 15, 2013 at 23:24 UTC | |
|
Re: TreeBuilder and encoding
by Anonymous Monk on Jul 15, 2013 at 02:41 UTC | |
|
Re: TreeBuilder and encoding
by 2teez (Vicar) on Jul 15, 2013 at 08:52 UTC | |
by VineMob (Initiate) on Jul 15, 2013 at 13:59 UTC | |
by Khen1950fx (Canon) on Jul 15, 2013 at 23:08 UTC | |
|
Re: TreeBuilder and encoding
by VineMob (Initiate) on Jul 16, 2013 at 01:55 UTC |