in reply to Problem displaying unicode for certain websites

In the previous thread of this discussion, I mentioned

The real question is "is the input a string of bytes or a string of characters, and is the output a string of bytes of a string of characters". Try the various combinations.

So what did you find out?

  • Comment on Re: Problem displaying unicode for certain websites

Replies are listed 'Best First'.
Re^2: Problem displaying unicode for certain websites
by ikegami (Patriarch) on Dec 12, 2009 at 10:51 UTC

    Turns out the advice wouldn't have gotten you far.

    You know what, it does return garbage. And not just the "we can fudge it" kind you usually get from modules that predate Perl's support of Unicode, either.

    That is, unless you set the decode_entities option to false. When you do, you have:

    get returns decoded html.
    HTML::Strip->parse wants html encoded using an ASCII-derived encoding.
    HTML::Strip->parse returns similarly encoded text.

    So here's a workaround to use the doubly-buggy module:

    use strict; use warnings; use open ':std', ':locale'; use LWP::Simple qw( get ); use HTML::Strip qw( ); use HTML::Entities qw( decode_entities ); my $url = $ARGV[0]; defined( my $decoded_html = get($url) ) or die("Couldn't fetch $url\n"); my $hs = HTML::Strip->new( decode_entities => 0 ); utf8::encode( my $utf8_html = $decoded_html ); my $utf8_text = $hs->parse( $utf8_html ); utf8::decode( my $decoded_text = $utf8_text ); $decoded_text = decode_entities($decoded_text); $decoded_text =~ s/^\s+//; print substr($decoded_text, 0, 400);

    I posted a similar program in response to the aforementioned bug report.

Re^2: Problem displaying unicode for certain websites
by Anonymous Monk on Dec 12, 2009 at 10:54 UTC
    I'm a little confused. A unicode string stores a set of bytes internally and these bytes represent a set of characters. One character might need a number of bytes within this internal representation. An ascii string is the same idea except that only a single byte is needed to represent a character. But how do I know if a given variable stores a unicode or ascii string? Am I right in saying that if the get() function is given a unicode string as argument that it will return a unicode string? This wouldn't mean that my svd string is in ascii and my expressen string is in unicode and that doesn't make any sense to me. Please help!

      But how do I know if a given variable stores a unicode or ascii string?

      It contains what you put in it. What did you put in it?


      You have strings of (Unicode) characters and strings of bytes.

      If the string contains chr(0x2660), it's obviously not a string of bytes. If the string contains chr(0x41), it could be anything. ASCII 'A', the number 65, or something completely different.

      If you pass a string with chr(0x41) in it to a function, you're not gonna get much information out of it. What you do is pass a string with something that can't be a byte in it. If it works, you know it's expecting characters.

        Thanks ikegami! Your code gives me -
        Oj, f\x{00e5}r vi ingen mat?!
        instead of -
        Oj, får vi ingen mat?!
        How come?
      Ah, OK thanks!
        One last question. Should I be using alternatives to these buggy modules?