Utf8 experts help!

chessgui has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Utf8 experts help! by moritz (Cardinal) on Jan 24, 2012 at 10:13 UTC
My questions: how do I know that I have to apply decoding more than once You have to deduce that from the input yourself. There's no way perl could know it for you. However if you have a sample of the input, the same sample correctly decoded and an idea what encodings are invovled, my module Encode::Repair can help you. and why different utf8 decoders behave differently in perl? Encode::decode is a high level API, and thus does extra sanity checks that the low-level API does not. Perl 6 - second systems done right	[reply]
Re: Utf8 experts help! by Anonymous Monk on Jan 24, 2012 at 10:18 UTC
It would be more helpful if you supplied a hex/oct dump of the data (Devel::Peek is okay) and the actual code that does not work for you, complete with accurate, not paraphrased, error messages so we don't have to speculate. Report the facts, not your observations about them. I'll assume that your made your observation correctly and it is indeed double-encoded UTF-8, instead of the more likelier misencoding occuring in the wild: Windows-1251 plus UTF-8. The character string `Гарри Каспаров` would be UTF-8 double-encoded to `"\303\220\302\223\303\220\302\260\303\221\302\200\303\221\302\200\303\220\302\270 \303\220\302\232\303\220\302\260\303\221\302\201\303\220\302\277\303\220\302\260\303\221\302\200\303\220\302\276\303\220\302\262"`. Assigning that to a variable and decoding twice works for me and gives no error message. `use strictures; use Encode qw(); my $what = "\303\220\302\223\303\220\302\260\303\221\302\200\303\2 +21\302\200\303\220\302\270 \303\220\302\232\303\220\302\260\303\221\3 +02\201\303\220\302\277\303\220\302\260\303\221\302\200\303\220\302\27 +6\303\220\302\262"; $what = Encode::decode('utf8',$what); $what = Encode::decode('utf8',$what);` [download] Your turn…	[reply] [d/l]
Re^2: Utf8 experts help! by chessgui (Scribe) on Jan 24, 2012 at 12:37 UTC
Your comment made me do a deeper analysis of my prog and it turned out that the error with Encode::decode('utf8' occurs when it attempts to decode a string read from a file which contains already processed data (wide characters). This does not seem to bother utf8::decode, but hangs Encode::decode in the second round (why not in the first round???). decoded twice with utf8::decode sub my_decode { my $what=shift; my $comment=shift; my $what_old=$what; open STDERR,'>>decode_dump.txt'; print STDERR "\n---------------------\n$comment\n----------------- +----\nORIGINAL:\n---------------------\n"; Dump $what; utf8::decode($what);$what_decoded_once=$what; #$what_decoded_once=Encode::decode('utf8',$what); print STDERR "\n---------------------\nDECODED ONCE:\n------------ +---------\n"; Dump $what_decoded_once; utf8::decode($what);$what_decoded_twice=$what; #$what_decoded_twice=Encode::decode('utf8',$what_decoded_once); print STDERR "\n---------------------\nDECODED TWICE:\n----------- +----------\n"; Dump $what_decoded_twice; exit if( ($what_old ne $what_decoded_twice) && (!($what=~/[a-z]+/) +) && ($comment eq 'player white name') ); return $what_decoded_twice; } ######################################################## output: ######################################################## --------------------- player white name --------------------- ORIGINAL: --------------------- SV = PV(0x256fa7c) at 0xe3a62c REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x12d45bc "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 CUR = 25 LEN = 28 --------------------- DECODED ONCE: --------------------- SV = PV(0x256fb14) at 0xe424cc REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x25ab78c "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28 --------------------- DECODED TWICE: --------------------- SV = PV(0x256fae4) at 0xe4251c REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x25ab9cc "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28 [download] decoded twice with Encode::decode('utf8' sub my_decode { my $what=shift; my $comment=shift; my $what_old=$what; open STDERR,'>>decode_dump.txt'; print STDERR "\n---------------------\n$comment\n----------------- +----\nORIGINAL:\n---------------------\n"; Dump $what; #utf8::decode($what);$what_decoded_once=$what; $what_decoded_once=Encode::decode('utf8',$what); print STDERR "\n---------------------\nDECODED ONCE:\n------------ +---------\n"; Dump $what_decoded_once; #utf8::decode($what);$what_decoded_twice=$what; $what_decoded_twice=Encode::decode('utf8',$what_decoded_once); print STDERR "\n---------------------\nDECODED TWICE:\n----------- +----------\n"; Dump $what_decoded_twice; exit if( ($what_old ne $what_decoded_twice) && (!($what=~/[a-z]+/) +) && ($comment eq 'player white name') ); return $what_decoded_twice; } ######################################################## output: ######################################################## --------------------- player white name --------------------- ORIGINAL: --------------------- SV = PV(0x256f0cc) at 0xe3a62c REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x2620f0c "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 CUR = 25 LEN = 28 --------------------- DECODED ONCE: --------------------- SV = PV(0x256f0b4) at 0xe424bc REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2771b74 "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28 Tk::Error: Cannot decode string with wide characters at C:/strawberry/ +perl/lib/Encode.pm line 174. Encode::decode at C:/strawberry/perl/lib/Encode.pm line 174 Library::my_decode at Library.pm line 321 GameBrowser::analyze_games_index_inner at GameBrowser.pm line 981 GameBrowser::__ANON__ at GameBrowser.pm line 828 GameBrowser::for_all_players at GameBrowser.pm line 813 GameBrowser::analyze_games_index at GameBrowser.pm line 831 GameBrowser::load_games_index at GameBrowser.pm line 1041 Tk callback for .toplevel.frame.frame.button Tk::__ANON__ at C:/strawberry/perl/site/lib/Tk.pm line 251 Tk::Button::butUp at C:/strawberry/perl/site/lib/Tk/Button.pm line 17 +5 <ButtonRelease-1> (command bound to event) Terminating on signal SIGHUP(1) [download]	[reply] [d/l] [select]
Re^3: Utf8 experts help! by Anonymous Monk on Jan 24, 2012 at 13:53 UTC
Now that you posted code, I see your problem. You should have posted your real code and data right in the beginning, remember this for the next time! The problem lies in a wrong assumption. Somehow you got the idea that you have double-encoded UTF-8, but the octet sequence `\320\245\320\260\321\207\320\260\321\202\321\203\321\200 \320\241\321\203\320\263\321\217\320\275` which you have read from a file or the network is actually simple encoded UTF-8. Trying to decode it twice in a row is a programmer mistake. `decode 'UTF-8', "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 \320\241\321\203\320\263\321\217\320\275"` returns the Perl character string `Хачатур Сугян`. Operate on that result. Completely read http://p3rl.org/UNI to learn about the topic of encoding in Perl. Forget that the pragma `utf8` also has some functions; you must use the Encode module only.	[reply] [d/l]
Re^3: Utf8 experts help! by choroba (Cardinal) on Jan 24, 2012 at 13:23 UTC
the error with Encode::decode('utf8' occurs when it attempts to decode a string read from a file which contains already processed data There is probably an error in the code that saves the processed data. Do you set any encoding on the output file?	[reply]
Re^4: Utf8 experts help! by chessgui (Scribe) on Jan 24, 2012 at 14:06 UTC
Re^5: Utf8 experts help! by chessgui (Scribe) on Jan 24, 2012 at 14:29 UTC
Re^5: Utf8 experts help! by ikegami (Patriarch) on Jan 24, 2012 at 21:15 UTC
Re^3: Utf8 experts help! by ikegami (Patriarch) on Jan 24, 2012 at 21:12 UTC
utf8::decode DID return an error. You just ignored it (didn't check for it). use Encode qw( decode ); print("Using Encode::decode:\n"); eval { $s = "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 " . "\320\241\321\203\320\263\321\217\320\275"; $s = decode("UTF-8", $s); $s = decode("UTF-8", $s); 1 } or print($@); print("\n"); print("Using utf8::decode\n"); eval { $s = "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 " . "\320\241\321\203\320\263\321\217\320\275"; utf8::decode($s) or die("Not valid UTF-8"); utf8::decode($s) or die("Not valid UTF-8"); # line 19 1 } or print($@); [download] `Using Encode::decode: Cannot decode string with wide characters at .../Encode.pm line 174. Using utf8::decode Not valid UTF-8 at a.pl line 19.` [download]	[reply] [d/l] [select]
Re: Utf8 experts help! by Anonymous Monk on Jan 24, 2012 at 09:40 UTC
It most likely isn't utf8, it is probably utf16~ or utf32~, so decoding it as utf8 is an error. Examine the bytes, hexdump/od -tacx1, Encode::Guess	[reply]
Re^2: Utf8 experts help! by chessgui (Scribe) on Jan 24, 2012 at 10:05 UTC
This code does the same as a single utf8::decode: `use Encode::Guess; sub my_decode { my $what=shift; #utf8::decode($what);utf8::decode($what); my $enc = guess_encoding($what); if(ref($enc)) { $what = decode($enc->name, $what); } return $what; }` [download]	[reply] [d/l]
Re^2: Utf8 experts help! by chessgui (Scribe) on Jan 24, 2012 at 09:46 UTC
The response header (Content-Type: text/html; charset=utf-8) and the HTML code itself (<meta charset="utf-8" />) says explicitly that it is utf8. Why would utf8::decode work applied twice if it were not utf8?	[reply]