Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I have the following code -
use strict; use warnings; use charnames qw(:full); use LWP::Simple; use HTML::Strip; use Encode::Encoder qw(encoder); binmode STDOUT, ":encoding(UTF-8)"; my $url = "http://www.svd.se/nyheter/utrikes/spricka-mellan-u-lander-o +roar_3926455.svd"; #my $url = "http://www.expressen.se"; my $html = get($url); defined $html or die "Can't fetch HTML from: ",$url; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $html ); print $clean_text;
I have set my console to display unicode but I get Mojibake for certain websites. The website that is commented out displays fine in my console window but not the other website. How can I get $html to store unicode for all websites?
Thanks for your help!

Replies are listed 'Best First'.
Re: Problem displaying unicode for certain websites
by ikegami (Patriarch) on Dec 12, 2009 at 10:09 UTC

    In the previous thread of this discussion, I mentioned

    The real question is "is the input a string of bytes or a string of characters, and is the output a string of bytes of a string of characters". Try the various combinations.

    So what did you find out?

      Turns out the advice wouldn't have gotten you far.

      You know what, it does return garbage. And not just the "we can fudge it" kind you usually get from modules that predate Perl's support of Unicode, either.

      That is, unless you set the decode_entities option to false. When you do, you have:

      get returns decoded html.
      HTML::Strip->parse wants html encoded using an ASCII-derived encoding.
      HTML::Strip->parse returns similarly encoded text.

      So here's a workaround to use the doubly-buggy module:

      use strict; use warnings; use open ':std', ':locale'; use LWP::Simple qw( get ); use HTML::Strip qw( ); use HTML::Entities qw( decode_entities ); my $url = $ARGV[0]; defined( my $decoded_html = get($url) ) or die("Couldn't fetch $url\n"); my $hs = HTML::Strip->new( decode_entities => 0 ); utf8::encode( my $utf8_html = $decoded_html ); my $utf8_text = $hs->parse( $utf8_html ); utf8::decode( my $decoded_text = $utf8_text ); $decoded_text = decode_entities($decoded_text); $decoded_text =~ s/^\s+//; print substr($decoded_text, 0, 400);

      I posted a similar program in response to the aforementioned bug report.

      I'm a little confused. A unicode string stores a set of bytes internally and these bytes represent a set of characters. One character might need a number of bytes within this internal representation. An ascii string is the same idea except that only a single byte is needed to represent a character. But how do I know if a given variable stores a unicode or ascii string? Am I right in saying that if the get() function is given a unicode string as argument that it will return a unicode string? This wouldn't mean that my svd string is in ascii and my expressen string is in unicode and that doesn't make any sense to me. Please help!

        But how do I know if a given variable stores a unicode or ascii string?

        It contains what you put in it. What did you put in it?


        You have strings of (Unicode) characters and strings of bytes.

        If the string contains chr(0x2660), it's obviously not a string of bytes. If the string contains chr(0x41), it could be anything. ASCII 'A', the number 65, or something completely different.

        If you pass a string with chr(0x41) in it to a function, you're not gonna get much information out of it. What you do is pass a string with something that can't be a byte in it. If it works, you know it's expecting characters.

        Ah, OK thanks!