Problem displaying unicode for certain websites

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I have the following code -

use strict;
use warnings;
use charnames qw(:full);
use LWP::Simple;
use HTML::Strip;
use Encode::Encoder qw(encoder);

binmode STDOUT, ":encoding(UTF-8)";

my $url = "http://www.svd.se/nyheter/utrikes/spricka-mellan-u-lander-o
+roar_3926455.svd";
#my $url = "http://www.expressen.se";
my $html = get($url);

defined $html or die "Can't fetch HTML from: ",$url;

my $hs = HTML::Strip->new();
my $clean_text = $hs->parse( $html );
print $clean_text;
[download]

I have set my console to display unicode but I get Mojibake for certain websites. The website that is commented out displays fine in my console window but not the other website. How can I get $html to store unicode for all websites?
Thanks for your help!

Comment on Problem displaying unicode for certain websites Download Code

Replies are listed 'Best First'.
Re: Problem displaying unicode for certain websites by ikegami (Patriarch) on Dec 12, 2009 at 10:09 UTC
In the previous thread of this discussion, I mentioned The real question is "is the input a string of bytes or a string of characters, and is the output a string of bytes of a string of characters". Try the various combinations. So what did you find out?	[reply]
Re^2: Problem displaying unicode for certain websites by ikegami (Patriarch) on Dec 12, 2009 at 10:51 UTC
Turns out the advice wouldn't have gotten you far. You know what, it does return garbage. And not just the "we can fudge it" kind you usually get from modules that predate Perl's support of Unicode, either. That is, unless you set the `decode_entities` option to false. When you do, you have: `get` returns decoded html. `HTML::Strip->parse` wants html encoded using an ASCII-derived encoding. `HTML::Strip->parse` returns similarly encoded text. So here's a workaround to use the doubly-buggy module: use strict; use warnings; use open ':std', ':locale'; use LWP::Simple qw( get ); use HTML::Strip qw( ); use HTML::Entities qw( decode_entities ); my $url = $ARGV[0]; defined( my $decoded_html = get($url) ) or die("Couldn't fetch $url\n"); my $hs = HTML::Strip->new( decode_entities => 0 ); utf8::encode( my $utf8_html = $decoded_html ); my $utf8_text = $hs->parse( $utf8_html ); utf8::decode( my $decoded_text = $utf8_text ); $decoded_text = decode_entities($decoded_text); $decoded_text =~ s/^\s+//; print substr($decoded_text, 0, 400); [download] I posted a similar program in response to the aforementioned bug report.	[reply] [d/l] [select]
Re^2: Problem displaying unicode for certain websites by Anonymous Monk on Dec 12, 2009 at 10:54 UTC
I'm a little confused. A unicode string stores a set of bytes internally and these bytes represent a set of characters. One character might need a number of bytes within this internal representation. An ascii string is the same idea except that only a single byte is needed to represent a character. But how do I know if a given variable stores a unicode or ascii string? Am I right in saying that if the get() function is given a unicode string as argument that it will return a unicode string? This wouldn't mean that my svd string is in ascii and my expressen string is in unicode and that doesn't make any sense to me. Please help!	[reply]
Re^3: Problem displaying unicode for certain websites by ikegami (Patriarch) on Dec 12, 2009 at 11:10 UTC
But how do I know if a given variable stores a unicode or ascii string? It contains what you put in it. What did you put in it? You have strings of (Unicode) characters and strings of bytes. If the string contains chr(0x2660), it's obviously not a string of bytes. If the string contains chr(0x41), it could be anything. ASCII 'A', the number 65, or something completely different. If you pass a string with chr(0x41) in it to a function, you're not gonna get much information out of it. What you do is pass a string with something that can't be a byte in it. If it works, you know it's expecting characters.	[reply]
Re^4: Problem displaying unicode for certain websites by Anonymous Monk on Dec 12, 2009 at 11:14 UTC
Re^5: Problem displaying unicode for certain websites by Anonymous Monk on Dec 12, 2009 at 11:25 UTC
Some notes below your chosen depth have not been shown here
Re^3: Problem displaying unicode for certain websites by Anonymous Monk on Dec 12, 2009 at 10:58 UTC
Ah, OK thanks!	[reply]
Re^4: Problem displaying unicode for certain websites by Anonymous Monk on Dec 12, 2009 at 11:01 UTC