Re^2: Problem displaying unicode for certain websites

Turns out the advice wouldn't have gotten you far.

You know what, it does return garbage. And not just the "we can fudge it" kind you usually get from modules that predate Perl's support of Unicode, either.

That is, unless you set the decode_entities option to false. When you do, you have:

get returns decoded html.
HTML::Strip->parse wants html encoded using an ASCII-derived encoding.
HTML::Strip->parse returns similarly encoded text.

So here's a workaround to use the doubly-buggy module:

use strict;
use warnings;

use open ':std', ':locale';

use LWP::Simple    qw( get );
use HTML::Strip    qw( );
use HTML::Entities qw( decode_entities );

my $url = $ARGV[0];

defined( my $decoded_html = get($url) )
   or die("Couldn't fetch $url\n");

my $hs = HTML::Strip->new( decode_entities => 0 );
utf8::encode( my $utf8_html = $decoded_html );
my $utf8_text = $hs->parse( $utf8_html );
utf8::decode( my $decoded_text = $utf8_text );
$decoded_text = decode_entities($decoded_text);

$decoded_text =~ s/^\s+//;
print substr($decoded_text, 0, 400);
[download]

I posted a similar program in response to the aforementioned bug report.

Comment on Re^2: Problem displaying unicode for certain websites Select or Download Code