Re: Problem displaying unicode for certain websites

Replies are listed 'Best First'.
Re^2: Problem displaying unicode for certain websites by ikegami (Patriarch) on Dec 12, 2009 at 10:51 UTC
Turns out the advice wouldn't have gotten you far. You know what, it does return garbage. And not just the "we can fudge it" kind you usually get from modules that predate Perl's support of Unicode, either. That is, unless you set the `decode_entities` option to false. When you do, you have: `get` returns decoded html. `HTML::Strip->parse` wants html encoded using an ASCII-derived encoding. `HTML::Strip->parse` returns similarly encoded text. So here's a workaround to use the doubly-buggy module: use strict; use warnings; use open ':std', ':locale'; use LWP::Simple qw( get ); use HTML::Strip qw( ); use HTML::Entities qw( decode_entities ); my $url = $ARGV[0]; defined( my $decoded_html = get($url) ) or die("Couldn't fetch $url\n"); my $hs = HTML::Strip->new( decode_entities => 0 ); utf8::encode( my $utf8_html = $decoded_html ); my $utf8_text = $hs->parse( $utf8_html ); utf8::decode( my $decoded_text = $utf8_text ); $decoded_text = decode_entities($decoded_text); $decoded_text =~ s/^\s+//; print substr($decoded_text, 0, 400); [download] I posted a similar program in response to the aforementioned bug report.	[reply] [d/l] [select]
Re^2: Problem displaying unicode for certain websites by Anonymous Monk on Dec 12, 2009 at 10:54 UTC
I'm a little confused. A unicode string stores a set of bytes internally and these bytes represent a set of characters. One character might need a number of bytes within this internal representation. An ascii string is the same idea except that only a single byte is needed to represent a character. But how do I know if a given variable stores a unicode or ascii string? Am I right in saying that if the get() function is given a unicode string as argument that it will return a unicode string? This wouldn't mean that my svd string is in ascii and my expressen string is in unicode and that doesn't make any sense to me. Please help!	[reply]
Re^3: Problem displaying unicode for certain websites by ikegami (Patriarch) on Dec 12, 2009 at 11:10 UTC
But how do I know if a given variable stores a unicode or ascii string? It contains what you put in it. What did you put in it? You have strings of (Unicode) characters and strings of bytes. If the string contains chr(0x2660), it's obviously not a string of bytes. If the string contains chr(0x41), it could be anything. ASCII 'A', the number 65, or something completely different. If you pass a string with chr(0x41) in it to a function, you're not gonna get much information out of it. What you do is pass a string with something that can't be a byte in it. If it works, you know it's expecting characters.	[reply]
Re^4: Problem displaying unicode for certain websites by Anonymous Monk on Dec 12, 2009 at 11:14 UTC
Thanks ikegami! Your code gives me - `Oj, f\x{00e5}r vi ingen mat?!` [download] instead of - `Oj, får vi ingen mat?!` [download] How come?	[reply] [d/l] [select]
Re^5: Problem displaying unicode for certain websites by Anonymous Monk on Dec 12, 2009 at 11:25 UTC
Re^6: Problem displaying unicode for certain websites by ikegami (Patriarch) on Dec 12, 2009 at 19:04 UTC
Re^3: Problem displaying unicode for certain websites by Anonymous Monk on Dec 12, 2009 at 10:58 UTC
Ah, OK thanks!	[reply]
Re^4: Problem displaying unicode for certain websites by Anonymous Monk on Dec 12, 2009 at 11:01 UTC
One last question. Should I be using alternatives to these buggy modules?	[reply]