http://qs1969.pair.com?node_id=830439

brycen has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, Unicoding is driving me batty today. I'm trying to understand this test output:
Ein <#c3><#96>konomisches Modell Ein <#d6>konomisches Modell 1 Ein <#d6>konomisches Modell 1 Ein <#c3><#83><#c2><#96>konomisches Modell Ein <#c3><#96>konomisches Modell Ein <#c3><#96>konomisches Modell
Where strings supposedly in perl's internal utf8 show <#d6> in some cases and <#c3><#96> in others. This matters because we use Storable::nfreeze() to store the data, which just copies Perl's internal format. What I expected from the script was the two byte sequence in all cases.
#!/usr/bin/perl -w no utf8; use Encode; my ($ustring1, $ustring2, $ustring3); $ustring1 = "Ein Ökonomisches Modell"; $ustring2 = "Ein \326konomisches Modell"; utf8::upgrade($ustring2); $ustring3 = "Ein \326konomisches Modell"; $ustring3 = decode('latin1',$ustring3, 1); print "\n"; print print_chrcodes($ustring1)." ".utf8::is_utf8($ustring1)."\n"; print print_chrcodes($ustring2)." ".utf8::is_utf8($ustring2)."\n"; print print_chrcodes($ustring3)." ".utf8::is_utf8($ustring3)."\n"; print "\n"; print print_chrcodes(encode('utf-8-strict',$ustring1))."\n"; print print_chrcodes(encode('utf-8-strict',$ustring2))."\n"; print print_chrcodes(encode('utf-8-strict',$ustring3))."\n"; sub print_chrcodes { my $str = shift; my $ret; foreach my $ascval (unpack("C*", $str)) { if($ascval == 13) { $ret .= '<cr>'; next; } if($ascval == 10) { $ret .= '<nl>'; next; } if($ascval < 128) { $ret .= chr($ascval); next; } $ret .= sprintf ("<#%x>",$ascval) if($ascval >= 128); } return $ret; }
How can I force the internal perl representation to be two-byte utf-8, so that Storable::nfreeze() output approximates utf-8-strict? Keywords: Unicode, utf-8, utf-8-strict, Perl 5.10, Storable.

Replies are listed 'Best First'.
Re: Why does perl's internal utf8 seem to allow single-byte latin1?
by ikegami (Patriarch) on Mar 24, 2010 at 03:30 UTC

    What I expected from the script was the two byte sequence in all cases.

    Your expectations are wrong for ustring1. There's nothing that caused it to be changed to the less efficient storage format.

    utf8::is_utf8 pointed this out, and pointed out your expectations were accurate for ustring2 and ustring3.

    print_chrcode doesn't look at the internal format. It looks at the content of the string. That's why it didn't tell you anything.

    ( The previous paragraph is wrong if you happen to use the buggy version of Perl the OP is using. I didn't notice the OP had included the output of this program. With 5.10, you get

    Ein <#d6>konomisches Modell Ein <#d6>konomisches Modell 1 Ein <#d6>konomisches Modell 1 Ein <#c3><#96>konomisches Modell Ein <#c3><#96>konomisches Modell Ein <#c3><#96>konomisches Modell
    )

    How can I force the internal perl representation to be two-byte utf-8

    utf8::upgrade and utf8::downgrade are used to switch between the two internal formats.

    use Devel::Peek qw( Dump ); my $s1 = "Ein Ökonomisches Modell"; my $s2 = "Ein \326konomisches Modell"; Dump($s1); Dump($s2); utf8::upgrade( $s1 ); utf8::upgrade( $s2 ); Dump($s1); Dump($s2);
    SV = PV(0x2369cc) at 0x182a354 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x23fcc4 "Ein \326konomisches Modell"\0 CUR = 23 LEN = 24 SV = PV(0x2369dc) at 0x182a384 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x23fd9c "Ein \326konomisches Modell"\0 CUR = 23 LEN = 24 SV = PV(0x2369cc) at 0x182a354 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x182430c "Ein \303\226konomisches Modell"\0 [UTF8 "Ein \x{d6}k +onomisches Modell"] CUR = 24 LEN = 25 SV = PV(0x2369dc) at 0x182a384 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1832744 "Ein \303\226konomisches Modell"\0 [UTF8 "Ein \x{d6}k +onomisches Modell"] CUR = 24 LEN = 25

    All that being said, I have no idea what you are trying to accomplish. Sounds very very wrong.

      The given output was from perl v5.10.0, from Debian stable. I see the flaw in print_chrcode, and have added 'use bytes' to get it to display the true internal format (matching Devel::Peek). Using "export PERL_UNICODE=SAD" or "export PERL_UNICODE=IE" switches the behavior of the script.
        oh, your source file is encoded using UTF-8 despite the no utf8;.

        Using "export PERL_UNICODE=SAD" or "export PERL_UNICODE=IE" switches the behavior of the script.

        Not they don't. You don't use @ARGV, you don't use STD* for anything but 7-bit chars, and you don't open any file handles. They have no effect whatsoever.

        Again, what are you trying to do? Whatever it is, you seem to be taking the worst possible approach.

Re: Why does perl's internal utf8 seem to allow single-byte latin1?
by repellent (Priest) on Mar 24, 2010 at 06:48 UTC
    Please read this first.

      How can I force the internal perl representation to be two-byte utf-8, ... ?

    Encode the string of characters.
    use Devel::Peek; my $str = "\326"; Dump $str; # not UTF-8 (1 char is 1 byte) my $utf8 = $str; utf8::upgrade($utf8); Dump $utf8; # upgraded to UTF-8 (1 char is 2 bytes) my $utf8_as_bytes = $str; utf8::encode($utf8_as_bytes); Dump $utf8_as_bytes; # not UTF-8 (2 chars with 1 byte each) __END__ SV = PV(0x8519228) at 0x8061e38 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x852b250 "\326"\0 CUR = 1 LEN = 4 SV = PV(0x84556a0) at 0x8528b40 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x852b290 "\303\226"\0 [UTF8 "\x{d6}"] CUR = 2 LEN = 3 SV = PV(0x84556a8) at 0x8528730 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x852b2a0 "\303\226"\0 CUR = 2 LEN = 3
Re: Why does perl's internal utf8 seem to allow single-byte latin1?
by JavaFan (Canon) on Mar 24, 2010 at 08:25 UTC
    What I expected from the script was the two byte sequence in all cases.
    What I don't understand is that you go to the trouble of writing a script that shows your problem (good!), but then fail to 1) show the output of the script, and 2) writing down what you expected it to print.

    Not having people have to cut, paste and run your script greatly improves the chances of getting a useful answer.

      The output of the script is in the original post:
      Ein <#c3><#96>konomisches Modell Ein <#d6>konomisches Modell 1 Ein <#d6>konomisches Modell 1
      I expected to see <#c3><#96> in all three cases. The actual confusion on the test script turns out to be a bug in the "print_chrcodes" function. I either should have used Devel::Peek, or ensured the function used "use bytes":
      sub print_chrcodes { my $str = shift; my $ret; use bytes; #<<<<<<<<<here foreach my $ascval (unpack("C*", $str)) { if($ascval == 13) { $ret .= '<cr>'; next; } if($ascval == 10) { $ret .= '<nl>'; next; } if($ascval < 128) { $ret .= chr($ascval); next; } $ret .= sprintf ("<#%x>",$ascval) if($ascval >= 128); } return $ret; }
      What I was diagnosing was a situation where Unicode did not cleanly go through Storable::nfreeze, a database, then Storable::thaw :
      $hash->{TITLE}="f\xc3\xa4ce=\xe2\x98\xbb"; print "TITLE=$hash->{TITLE}\n"; my $nfreeze = Storable::nfreeze($hash); my $obj = Storable::thaw($nfreeze); print '${^UNICODE}='.${^UNICODE}."\n"; print "TITLE=$obj->{TITLE}\n";
      That eventually resolved itself. The setting of PERL_UNICODE=SAD vs. PERL_UNICODE=0 was masking the true problem, as it allowed certain sequences perl thought of as latin1 to be seen on the terminal as their equivalent utf-8 character (in my test case, black smiling face ☻).

      So in the end with: use encoding "utf-8-strict"; added to the script, and PERL_UNICODE=SAD, and proper terminal settings, and Apache Accept-Charset, and SET CLIENT-ENCODING UTF8 I'm round tripping Uncode through the entire Browser->Apache->Perl->Database->Perl->Apache->Browser System.

      Thanks for the help.