Unicode nightmare: is there a cheat sheet or solutions diagram?

rduke15 has asked for the wisdom of the Perl Monks concerning the following question:

Unicode has been torturing me for about a decade I guess.

I have read everything about it in perldoc and in various module's man pages, several times, but it's just too much. Every time, I have to re-read the docs and try many different things before it somehow works for one specific problem. Next time, something is different, and the whole reading-trying cycles start again. It drives me mad.

Isn't there a simple cheat sheet or diagram somewhere which would help me find the solution faster?

My current specific problem is:

HTML file in Latin1
Linux machine with Latin1 locale
Perl 5.8.8
HTML is parsed with HTML::TreeBuilder::XPath
text from $node->as_text() is printed to STDOUT

The output is UTF8!

(I didn't expect having a Unicode problem this time, since everything is Latin1)

Using the same script with the same input file on another machine with a UTF8 locale and Perl 5.10, the output is Latin1!! Probably just to annoy me, some gremlin goes to great lengths to ensure I get the wrong output...

The great solution which I dream of is a web form where I input my specifics (perl version, input source and encoding, wanted output destination and encoding), and get the answer of what I need to do (which module to load, when to decode or encode, etc.).

Comment on Unicode nightmare: is there a cheat sheet or solutions diagram?

Replies are listed 'Best First'.
Re: Unicode nightmare: is there a cheat sheet or solutions diagram? by Jeffrey Kegler (Hermit) on Jun 10, 2010 at 00:12 UTC
In many things "hack until it looks OK" is fine. Unicode is one of the exceptions. Get the concepts down first. Juerd did us all a great service with perlunitut and perlunifaq. Here's a great cheat sheet (also from Juerd): http://juerd.nl/site.plp/perluniadvice. While striving for Unicode enlightenment, meditate on this mantra: "Decode everything you receive, encode everything you send out. (If it's text data.)" It's not the totality of wisdom, but as its layers of meaning dawn, you will see it contains 90% of what you need to know. And it is a quote from ... what, did you guess Juerd? How'd you know?	[reply]
Re^2: Unicode nightmare: is there a cheat sheet or solutions diagram? by moritz (Cardinal) on Jun 10, 2010 at 07:10 UTC
While striving for Unicode enlightenment, meditate on this mantra: "Decode everything you receive, encode everything you send out. (If it's text data.)" It's not the totality of wisdom, but as its layers of meaning dawn, you will see it contains 90% of what you need to know. I agree, and from the description it sound as if rduke15 is actually doing that. Since Latin-1 is the default encoding, not doing anything with the input should be the as decoding as Latin-1 ... except that in older versions of Perl, it's sometimes not. I'm also intrigued by the description of the local having influence on the result. If I understood the documentation correctly, locales shouldn't have an effect unless you explicitly `use locale`. Is that correct? Still I have observed that it does have an effect, though I've never been able to put my finger on what exactly is going on when a non-UTF8 local comes into play. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l]
Re: Unicode nightmare: is there a cheat sheet or solutions diagram? by Anonymous Monk on Jun 09, 2010 at 11:37 UTC
Use `PerlIO::get_layers()` on STDOUT, check utf option....	[reply] [d/l]
Re: Unicode nightmare: is there a cheat sheet or solutions diagram? by mirod (Canon) on Jun 10, 2010 at 13:44 UTC
If that's any comfort, you are probably not the only one to have been tortured by Unicode for that long. I have a last-resort shell trick that goes `iconv -f utf8 -t utf8 <dodgy_file> \|\| iconv -f iso-8859-1 -t utf8 <dodgy_file> > <utf8_file>` for those cases where I am not 100% sure of the encoding I get, and I don't have the time to debug it properly. In your case I wonder if the problem is not that you have some non Latin1 characters in your input (that got there through entities in the HTML maybe, something like — for example. My machine has got UTF-8 locale, so I can't reproduce exactly your problem, but the 2 one liners below give me outputs in different encodings: `perl -MHTML::TreeBuilder::XPath -e'$t= HTML::TreeBuilder::XPath->new; $t->parse( "<html><body><p>para ©</p></body></html>"); $p=$t->findvalue( "//p"); print $p, "\n";'` [download] That first one gives me a result in latin1. `perl -MHTML::TreeBuilder::XPath -e'$t= HTML::TreeBuilder::XPath->new; $t->parse( "<html><body><p>para © —</p></body></html>"); @p=$t->findnodes( "//p"); print $p[0]->as_text(), "\n";'` [download] I just added an —, an entity that cannot be printed in Latin1, to the input et voilà! Now the result is in UTF-8. In that case using HTML::Entities should help: `perl -MHTML::TreeBuilder::XPath -MHTML::Entities -e' $t= HTML::TreeBuilder::XPath->new; $t->parse( "<html><body><p>para © —</p></body></html>"); @p=$t->findnodes( "//p"); $out= $p[0]->as_text; encode_entities( $out); print $out, "\n";'` [download]	[reply] [d/l] [select]