in reply to Utf8 experts help!

It would be more helpful if you supplied a hex/oct dump of the data (Devel::Peek is okay) and the actual code that does not work for you, complete with accurate, not paraphrased, error messages so we don't have to speculate. Report the facts, not your observations about them. I'll assume that your made your observation correctly and it is indeed double-encoded UTF-8, instead of the more likelier misencoding occuring in the wild: Windows-1251 plus UTF-8.

The character string Гарри Каспаров would be UTF-8 double-encoded to "\303\220\302\223\303\220\302\260\303\221\302\200\303\221\302\200\303\220\302\270 \303\220\302\232\303\220\302\260\303\221\302\201\303\220\302\277\303\220\302\260\303\221\302\200\303\220\302\276\303\220\302\262". Assigning that to a variable and decoding twice works for me and gives no error message.

use strictures; use Encode qw(); my $what = "\303\220\302\223\303\220\302\260\303\221\302\200\303\2 +21\302\200\303\220\302\270 \303\220\302\232\303\220\302\260\303\221\3 +02\201\303\220\302\277\303\220\302\260\303\221\302\200\303\220\302\27 +6\303\220\302\262"; $what = Encode::decode('utf8',$what); $what = Encode::decode('utf8',$what);

Your turn…

Replies are listed 'Best First'.
Re^2: Utf8 experts help!
by chessgui (Scribe) on Jan 24, 2012 at 12:37 UTC
    Your comment made me do a deeper analysis of my prog and it turned out that the error with Encode::decode('utf8' occurs when it attempts to decode a string read from a file which contains already processed data (wide characters). This does not seem to bother utf8::decode, but hangs Encode::decode in the second round (why not in the first round???).

    decoded twice with utf8::decode
    sub my_decode { my $what=shift; my $comment=shift; my $what_old=$what; open STDERR,'>>decode_dump.txt'; print STDERR "\n---------------------\n$comment\n----------------- +----\nORIGINAL:\n---------------------\n"; Dump $what; utf8::decode($what);$what_decoded_once=$what; #$what_decoded_once=Encode::decode('utf8',$what); print STDERR "\n---------------------\nDECODED ONCE:\n------------ +---------\n"; Dump $what_decoded_once; utf8::decode($what);$what_decoded_twice=$what; #$what_decoded_twice=Encode::decode('utf8',$what_decoded_once); print STDERR "\n---------------------\nDECODED TWICE:\n----------- +----------\n"; Dump $what_decoded_twice; exit if( ($what_old ne $what_decoded_twice) && (!($what=~/[a-z]+/) +) && ($comment eq 'player white name') ); return $what_decoded_twice; } ######################################################## output: ######################################################## --------------------- player white name --------------------- ORIGINAL: --------------------- SV = PV(0x256fa7c) at 0xe3a62c REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x12d45bc "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 CUR = 25 LEN = 28 --------------------- DECODED ONCE: --------------------- SV = PV(0x256fb14) at 0xe424cc REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x25ab78c "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28 --------------------- DECODED TWICE: --------------------- SV = PV(0x256fae4) at 0xe4251c REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x25ab9cc "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28
    decoded twice with Encode::decode('utf8'
    sub my_decode { my $what=shift; my $comment=shift; my $what_old=$what; open STDERR,'>>decode_dump.txt'; print STDERR "\n---------------------\n$comment\n----------------- +----\nORIGINAL:\n---------------------\n"; Dump $what; #utf8::decode($what);$what_decoded_once=$what; $what_decoded_once=Encode::decode('utf8',$what); print STDERR "\n---------------------\nDECODED ONCE:\n------------ +---------\n"; Dump $what_decoded_once; #utf8::decode($what);$what_decoded_twice=$what; $what_decoded_twice=Encode::decode('utf8',$what_decoded_once); print STDERR "\n---------------------\nDECODED TWICE:\n----------- +----------\n"; Dump $what_decoded_twice; exit if( ($what_old ne $what_decoded_twice) && (!($what=~/[a-z]+/) +) && ($comment eq 'player white name') ); return $what_decoded_twice; } ######################################################## output: ######################################################## --------------------- player white name --------------------- ORIGINAL: --------------------- SV = PV(0x256f0cc) at 0xe3a62c REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x2620f0c "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 CUR = 25 LEN = 28 --------------------- DECODED ONCE: --------------------- SV = PV(0x256f0b4) at 0xe424bc REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2771b74 "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28 Tk::Error: Cannot decode string with wide characters at C:/strawberry/ +perl/lib/Encode.pm line 174. Encode::decode at C:/strawberry/perl/lib/Encode.pm line 174 Library::my_decode at Library.pm line 321 GameBrowser::analyze_games_index_inner at GameBrowser.pm line 981 GameBrowser::__ANON__ at GameBrowser.pm line 828 GameBrowser::for_all_players at GameBrowser.pm line 813 GameBrowser::analyze_games_index at GameBrowser.pm line 831 GameBrowser::load_games_index at GameBrowser.pm line 1041 Tk callback for .toplevel.frame.frame.button Tk::__ANON__ at C:/strawberry/perl/site/lib/Tk.pm line 251 Tk::Button::butUp at C:/strawberry/perl/site/lib/Tk/Button.pm line 17 +5 <ButtonRelease-1> (command bound to event) Terminating on signal SIGHUP(1)
      Now that you posted code, I see your problem. You should have posted your real code and data right in the beginning, remember this for the next time!

      The problem lies in a wrong assumption. Somehow you got the idea that you have double-encoded UTF-8, but the octet sequence \320\245\320\260\321\207\320\260\321\202\321\203\321\200 \320\241\321\203\320\263\321\217\320\275 which you have read from a file or the network is actually simple encoded UTF-8. Trying to decode it twice in a row is a programmer mistake.

      decode 'UTF-8', "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 \320\241\321\203\320\263\321\217\320\275" returns the Perl character string Хачатур Сугян. Operate on that result.

      Completely read http://p3rl.org/UNI to learn about the topic of encoding in Perl. Forget that the pragma utf8 also has some functions; you must use the Encode module only.

      the error with Encode::decode('utf8' occurs when it attempts to decode a string read from a file which contains already processed data
      There is probably an error in the code that saves the processed data. Do you set any encoding on the output file?
        Files are written simply by print HANDLE "foo"; (yet in some cases HANDLE is opened in binary mode instead of text mode). On the other hand my prog has grown quite large, has many interacting modules and I can not exclude the possibility that at some point my_decode is called on some fields before saving them to file.

        Still I don't understand why Encode::decode('utf8' does not fail when reading wide characters in the first round only in the second round.

      utf8::decode DID return an error. You just ignored it (didn't check for it).

      use Encode qw( decode ); print("Using Encode::decode:\n"); eval { $s = "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 " . "\320\241\321\203\320\263\321\217\320\275"; $s = decode("UTF-8", $s); $s = decode("UTF-8", $s); 1 } or print($@); print("\n"); print("Using utf8::decode\n"); eval { $s = "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 " . "\320\241\321\203\320\263\321\217\320\275"; utf8::decode($s) or die("Not valid UTF-8"); utf8::decode($s) or die("Not valid UTF-8"); # line 19 1 } or print($@);
      Using Encode::decode: Cannot decode string with wide characters at .../Encode.pm line 174. Using utf8::decode Not valid UTF-8 at a.pl line 19.