in reply to Re: Utf8 experts help!
in thread Utf8 experts help!

Your comment made me do a deeper analysis of my prog and it turned out that the error with Encode::decode('utf8' occurs when it attempts to decode a string read from a file which contains already processed data (wide characters). This does not seem to bother utf8::decode, but hangs Encode::decode in the second round (why not in the first round???).

decoded twice with utf8::decode
sub my_decode { my $what=shift; my $comment=shift; my $what_old=$what; open STDERR,'>>decode_dump.txt'; print STDERR "\n---------------------\n$comment\n----------------- +----\nORIGINAL:\n---------------------\n"; Dump $what; utf8::decode($what);$what_decoded_once=$what; #$what_decoded_once=Encode::decode('utf8',$what); print STDERR "\n---------------------\nDECODED ONCE:\n------------ +---------\n"; Dump $what_decoded_once; utf8::decode($what);$what_decoded_twice=$what; #$what_decoded_twice=Encode::decode('utf8',$what_decoded_once); print STDERR "\n---------------------\nDECODED TWICE:\n----------- +----------\n"; Dump $what_decoded_twice; exit if( ($what_old ne $what_decoded_twice) && (!($what=~/[a-z]+/) +) && ($comment eq 'player white name') ); return $what_decoded_twice; } ######################################################## output: ######################################################## --------------------- player white name --------------------- ORIGINAL: --------------------- SV = PV(0x256fa7c) at 0xe3a62c REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x12d45bc "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 CUR = 25 LEN = 28 --------------------- DECODED ONCE: --------------------- SV = PV(0x256fb14) at 0xe424cc REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x25ab78c "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28 --------------------- DECODED TWICE: --------------------- SV = PV(0x256fae4) at 0xe4251c REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x25ab9cc "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28
decoded twice with Encode::decode('utf8'
sub my_decode { my $what=shift; my $comment=shift; my $what_old=$what; open STDERR,'>>decode_dump.txt'; print STDERR "\n---------------------\n$comment\n----------------- +----\nORIGINAL:\n---------------------\n"; Dump $what; #utf8::decode($what);$what_decoded_once=$what; $what_decoded_once=Encode::decode('utf8',$what); print STDERR "\n---------------------\nDECODED ONCE:\n------------ +---------\n"; Dump $what_decoded_once; #utf8::decode($what);$what_decoded_twice=$what; $what_decoded_twice=Encode::decode('utf8',$what_decoded_once); print STDERR "\n---------------------\nDECODED TWICE:\n----------- +----------\n"; Dump $what_decoded_twice; exit if( ($what_old ne $what_decoded_twice) && (!($what=~/[a-z]+/) +) && ($comment eq 'player white name') ); return $what_decoded_twice; } ######################################################## output: ######################################################## --------------------- player white name --------------------- ORIGINAL: --------------------- SV = PV(0x256f0cc) at 0xe3a62c REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x2620f0c "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 CUR = 25 LEN = 28 --------------------- DECODED ONCE: --------------------- SV = PV(0x256f0b4) at 0xe424bc REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2771b74 "\320\245\320\260\321\207\320\260\321\202\321\203\321 +\200 \320\241\321\203\320\263\321\217\320\275"\0 [UTF8 "\x{425}\x{430 +}\x{447}\x{430}\x{442}\x{443}\x{440} \x{421}\x{443}\x{433}\x{44f}\x{4 +3d}"] CUR = 25 LEN = 28 Tk::Error: Cannot decode string with wide characters at C:/strawberry/ +perl/lib/Encode.pm line 174. Encode::decode at C:/strawberry/perl/lib/Encode.pm line 174 Library::my_decode at Library.pm line 321 GameBrowser::analyze_games_index_inner at GameBrowser.pm line 981 GameBrowser::__ANON__ at GameBrowser.pm line 828 GameBrowser::for_all_players at GameBrowser.pm line 813 GameBrowser::analyze_games_index at GameBrowser.pm line 831 GameBrowser::load_games_index at GameBrowser.pm line 1041 Tk callback for .toplevel.frame.frame.button Tk::__ANON__ at C:/strawberry/perl/site/lib/Tk.pm line 251 Tk::Button::butUp at C:/strawberry/perl/site/lib/Tk/Button.pm line 17 +5 <ButtonRelease-1> (command bound to event) Terminating on signal SIGHUP(1)

Replies are listed 'Best First'.
Re^3: Utf8 experts help!
by Anonymous Monk on Jan 24, 2012 at 13:53 UTC
    Now that you posted code, I see your problem. You should have posted your real code and data right in the beginning, remember this for the next time!

    The problem lies in a wrong assumption. Somehow you got the idea that you have double-encoded UTF-8, but the octet sequence \320\245\320\260\321\207\320\260\321\202\321\203\321\200 \320\241\321\203\320\263\321\217\320\275 which you have read from a file or the network is actually simple encoded UTF-8. Trying to decode it twice in a row is a programmer mistake.

    decode 'UTF-8', "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 \320\241\321\203\320\263\321\217\320\275" returns the Perl character string Хачатур Сугян. Operate on that result.

    Completely read http://p3rl.org/UNI to learn about the topic of encoding in Perl. Forget that the pragma utf8 also has some functions; you must use the Encode module only.

Re^3: Utf8 experts help!
by choroba (Cardinal) on Jan 24, 2012 at 13:23 UTC
    the error with Encode::decode('utf8' occurs when it attempts to decode a string read from a file which contains already processed data
    There is probably an error in the code that saves the processed data. Do you set any encoding on the output file?
      Files are written simply by print HANDLE "foo"; (yet in some cases HANDLE is opened in binary mode instead of text mode). On the other hand my prog has grown quite large, has many interacting modules and I can not exclude the possibility that at some point my_decode is called on some fields before saving them to file.

      Still I don't understand why Encode::decode('utf8' does not fail when reading wide characters in the first round only in the second round.
        In the light of Anonymous Monk's comment I don't understand how a single time decoded sequence is generated in my prog since decoding takes place only in the 'my_decode' sub (either twice or not at all).

        Still I don't understand why Encode::decode('utf8' does not fail when reading wide characters in the first round only in the second round.

        There was no wide characters in the first round of decoding. You have the string "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 \320\241\321\203\320\263\321\217\320\275", and it only contains characters with values in [0..255]. A wide character is a character with value greater than 255.

Re^3: Utf8 experts help!
by ikegami (Patriarch) on Jan 24, 2012 at 21:12 UTC

    utf8::decode DID return an error. You just ignored it (didn't check for it).

    use Encode qw( decode ); print("Using Encode::decode:\n"); eval { $s = "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 " . "\320\241\321\203\320\263\321\217\320\275"; $s = decode("UTF-8", $s); $s = decode("UTF-8", $s); 1 } or print($@); print("\n"); print("Using utf8::decode\n"); eval { $s = "\320\245\320\260\321\207\320\260\321\202\321\203\321\200 " . "\320\241\321\203\320\263\321\217\320\275"; utf8::decode($s) or die("Not valid UTF-8"); utf8::decode($s) or die("Not valid UTF-8"); # line 19 1 } or print($@);
    Using Encode::decode: Cannot decode string with wide characters at .../Encode.pm line 174. Using utf8::decode Not valid UTF-8 at a.pl line 19.