chessgui has asked for the wisdom of the Perl Monks concerning the following question:

When downloading chess games with international players I have to decode utf8 coded names. This works as far as the names include only simple acute, umlaut etc. characters. In this case both "utf8::decode($what);" and "$what=Encode::decode('utf8',$what);" do the job. On the other hand especially with russian players where the name includes cyrillic characters the names seem to be coded in a more complicated way. In the latter case utf8 decoding results in a string which seems to be utf8 coded itself. This gave me the idea to apply decoding twice. However this only works with "utf8::decode($what);utf8::decode($what);" (and the result is what I expect - a readable string) but fails with "$what=Encode::decode('utf8',$what);$what=Encode::decode('utf8',$what);" (it reports an error for the second time - string not being an utf8 string).

My questions: how do I know that I have to apply decoding more than once and why different utf8 decoders behave differently in perl?

Replies are listed 'Best First'.
Re: Utf8 experts help!
by moritz (Cardinal) on Jan 24, 2012 at 10:13 UTC
    My questions: how do I know that I have to apply decoding more than once

    You have to deduce that from the input yourself. There's no way perl could know it for you.

    However if you have a sample of the input, the same sample correctly decoded and an idea what encodings are invovled, my module Encode::Repair can help you.

    and why different utf8 decoders behave differently in perl?

    Encode::decode is a high level API, and thus does extra sanity checks that the low-level API does not.

Re: Utf8 experts help!
by Anonymous Monk on Jan 24, 2012 at 10:18 UTC
    It would be more helpful if you supplied a hex/oct dump of the data (Devel::Peek is okay) and the actual code that does not work for you, complete with accurate, not paraphrased, error messages so we don't have to speculate. Report the facts, not your observations about them. I'll assume that your made your observation correctly and it is indeed double-encoded UTF-8, instead of the more likelier misencoding occuring in the wild: Windows-1251 plus UTF-8.

    The character string Гарри Каспаров would be UTF-8 double-encoded to "\303\220\302\223\303\220\302\260\303\221\302\200\303\221\302\200\303\220\302\270 \303\220\302\232\303\220\302\260\303\221\302\201\303\220\302\277\303\220\302\260\303\221\302\200\303\220\302\276\303\220\302\262". Assigning that to a variable and decoding twice works for me and gives no error message.

    use strictures; use Encode qw(); my $what = "\303\220\302\223\303\220\302\260\303\221\302\200\303\2 +21\302\200\303\220\302\270 \303\220\302\232\303\220\302\260\303\221\3 +02\201\303\220\302\277\303\220\302\260\303\221\302\200\303\220\302\27 +6\303\220\302\262"; $what = Encode::decode('utf8',$what); $what = Encode::decode('utf8',$what);

    Your turn…

      Your comment made me do a deeper analysis of my prog and it turned out that the error with Encode::decode('utf8' occurs when it attempts to decode a string read from a file which contains already processed data (wide characters). This does not seem to bother utf8::decode, but hangs Encode::decode in the second round (why not in the first round???).

      decoded twice with utf8::decode
      sub my_decode { my $what=shift; my $comment=shift; my $what_old=$what; open STDERR,'>>decode_dump.txt'; print STDERR "\n---------------------\n$comment\n----------------- +----\nORIGINAL:\n---------------------\n"; Dump $what; utf8::decode($what);$what_decoded_once=$what; #$what_decoded_once=Encode::decode('utf8',$what); print STDERR "\n---------------------\nDECODED ONCE:\n------------ +---------\n"; Dump $what_decoded_once; utf8::decode($what);$what_decoded_twice=$what; #$what_decoded_twice=Encode::decode('utf8',$what_decoded_once); print STDERR "\n---------------------\nDECODED TWICE:\n----------- +----------\n"; Dump $what_decoded_twice; exit if( ($what_old ne $what_decoded_twice) && (!($what=~/[a-z]+/) +) && ($comment eq 'player white name') ); return $what_decoded_twice; } ######################################################## output: ######################################################## --------------------- player white name --------------------- ORIGINAL: --------------------- SV = PV(0x256fa7c) at 0xe3a62c REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x12d45bc "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 CUR = 25 LEN = 28 --------------------- DECODED ONCE: --------------------- SV = PV(0x256fb14) at 0xe424cc REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x25ab78c "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28 --------------------- DECODED TWICE: --------------------- SV = PV(0x256fae4) at 0xe4251c REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x25ab9cc "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28
      decoded twice with Encode::decode('utf8'
      sub my_decode { my $what=shift; my $comment=shift; my $what_old=$what; open STDERR,'>>decode_dump.txt'; print STDERR "\n---------------------\n$comment\n----------------- +----\nORIGINAL:\n---------------------\n"; Dump $what; #utf8::decode($what);$what_decoded_once=$what; $what_decoded_once=Encode::decode('utf8',$what); print STDERR "\n---------------------\nDECODED ONCE:\n------------ +---------\n"; Dump $what_decoded_once; #utf8::decode($what);$what_decoded_twice=$what; $what_decoded_twice=Encode::decode('utf8',$what_decoded_once); print STDERR "\n---------------------\nDECODED TWICE:\n----------- +----------\n"; Dump $what_decoded_twice; exit if( ($what_old ne $what_decoded_twice) && (!($what=~/[a-z]+/) +) && ($comment eq 'player white name') ); return $what_decoded_twice; } ######################################################## output: ######################################################## --------------------- player white name --------------------- ORIGINAL: --------------------- SV = PV(0x256f0cc) at 0xe3a62c REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x2620f0c "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 CUR = 25 LEN = 28 --------------------- DECODED ONCE: --------------------- SV = PV(0x256f0b4) at 0xe424bc REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2771b74 "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28 Tk::Error: Cannot decode string with wide characters at C:/strawberry/ +perl/lib/Encode.pm line 174. Encode::decode at C:/strawberry/perl/lib/Encode.pm line 174 Library::my_decode at Library.pm line 321 GameBrowser::analyze_games_index_inner at GameBrowser.pm line 981 GameBrowser::__ANON__ at GameBrowser.pm line 828 GameBrowser::for_all_players at GameBrowser.pm line 813 GameBrowser::analyze_games_index at GameBrowser.pm line 831 GameBrowser::load_games_index at GameBrowser.pm line 1041 Tk callback for .toplevel.frame.frame.button Tk::__ANON__ at C:/strawberry/perl/site/lib/Tk.pm line 251 Tk::Button::butUp at C:/strawberry/perl/site/lib/Tk/Button.pm line 17 +5 <ButtonRelease-1> (command bound to event) Terminating on signal SIGHUP(1)
        Now that you posted code, I see your problem. You should have posted your real code and data right in the beginning, remember this for the next time!

        The problem lies in a wrong assumption. Somehow you got the idea that you have double-encoded UTF-8, but the octet sequence \320\245\320\260\321\207\320\260\321\202\321\203\321\200 \320\241\321\203\320\263\321\217\320\275 which you have read from a file or the network is actually simple encoded UTF-8. Trying to decode it twice in a row is a programmer mistake.

        decode 'UTF-8', "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 \320\241\321\203\320\263\321\217\320\275" returns the Perl character string Хачатур Сугян. Operate on that result.

        Completely read http://p3rl.org/UNI to learn about the topic of encoding in Perl. Forget that the pragma utf8 also has some functions; you must use the Encode module only.

        the error with Encode::decode('utf8' occurs when it attempts to decode a string read from a file which contains already processed data
        There is probably an error in the code that saves the processed data. Do you set any encoding on the output file?

        utf8::decode DID return an error. You just ignored it (didn't check for it).

        use Encode qw( decode ); print("Using Encode::decode:\n"); eval { $s = "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 " . "\320\241\321\203\320\263\321\217\320\275"; $s = decode("UTF-8", $s); $s = decode("UTF-8", $s); 1 } or print($@); print("\n"); print("Using utf8::decode\n"); eval { $s = "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 " . "\320\241\321\203\320\263\321\217\320\275"; utf8::decode($s) or die("Not valid UTF-8"); utf8::decode($s) or die("Not valid UTF-8"); # line 19 1 } or print($@);
        Using Encode::decode: Cannot decode string with wide characters at .../Encode.pm line 174. Using utf8::decode Not valid UTF-8 at a.pl line 19.
Re: Utf8 experts help!
by Anonymous Monk on Jan 24, 2012 at 09:40 UTC

    It most likely isn't utf8, it is probably utf16~ or utf32~, so decoding it as utf8 is an error. Examine the bytes, hexdump/od -tacx1, Encode::Guess

      This code does the same as a single utf8::decode:

      use Encode::Guess; sub my_decode { my $what=shift; #utf8::decode($what);utf8::decode($what); my $enc = guess_encoding($what); if(ref($enc)) { $what = decode($enc->name, $what); } return $what; }
      The response header (Content-Type: text/html; charset=utf-8) and the HTML code itself (<meta charset="utf-8" />) says explicitly that it is utf8. Why would utf8::decode work applied twice if it were not utf8?