in reply to Re: Problem displaying unicode for certain websites
in thread Problem displaying unicode for certain websites
Turns out the advice wouldn't have gotten you far.
You know what, it does return garbage. And not just the "we can fudge it" kind you usually get from modules that predate Perl's support of Unicode, either.
That is, unless you set the decode_entities option to false. When you do, you have:
get returns decoded html.
HTML::Strip->parse wants html encoded using an ASCII-derived encoding.
HTML::Strip->parse returns similarly encoded text.
So here's a workaround to use the doubly-buggy module:
use strict; use warnings; use open ':std', ':locale'; use LWP::Simple qw( get ); use HTML::Strip qw( ); use HTML::Entities qw( decode_entities ); my $url = $ARGV[0]; defined( my $decoded_html = get($url) ) or die("Couldn't fetch $url\n"); my $hs = HTML::Strip->new( decode_entities => 0 ); utf8::encode( my $utf8_html = $decoded_html ); my $utf8_text = $hs->parse( $utf8_html ); utf8::decode( my $decoded_text = $utf8_text ); $decoded_text = decode_entities($decoded_text); $decoded_text =~ s/^\s+//; print substr($decoded_text, 0, 400);
I posted a similar program in response to the aforementioned bug report.
|
|---|