in reply to Re: Mugged by UTF8, this CANNOT be right
in thread Mugged by UTF8, this CANNOT be right

This thread is yet another one of countless similar PerlMonks threads about Perl's Unicode support that immediately devolves into a discussion (debate) about how it all works, how it doesn't work, what the bugs are, what worked for him when he tried foo, what worked for her when she tried bar, and so on and so forth.

I didn't study the thread, but I read enough of it to make my head explode.

I haven't yet found any explanation of Perl's Unicode model I can understand and use to write Perl programs that handle Unicode text consistently and reliably in any of the Perl documentation or in any discussion threads or tutorials on PerlMonks.

  • Comment on Re^2: Mugged by UTF8, this CANNOT be right

Replies are listed 'Best First'.
Re^3: Mugged by UTF8, this CANNOT be right
by ikegami (Patriarch) on Jan 27, 2011 at 03:28 UTC

    The reason is that the DBDs are buggy, but this wasn't mentioned clearly, causing confusion.

    I haven't yet found any explanation of Perl's Unicode model I can understand and use to write Perl programs that handle Unicode text consistently and reliably in any of the Perl documentation or in any discussion threads or tutorials on PerlMonks.

    Decode inputs. Encode outputs. This will leave you only the bugs, and one can usually work around them using utf8::upgrade or utf8::downgrade.

    The catch is that there are LOTS of inputs and outputs to a program. It would be nice to be warned when one is missed. This would require languages to have different types for encoded and decoded strings. That flags I mentioned earlier would achieve this.

      I realize this is the conventional wisdom among the cognoscenti and what some of the Perl documentation tells you to do, but it flies in the face of both the reasonable expectations of many Perl programmers (tosh and I, for example) and a lot of existing documentation. How often do you see a Perl command-line script in a book or a code example in a perldoc page that decodes its input and encodes its output? Not very often.

      I've been trying to sort out an elusive problem I'm having with Unicode text queried from a Microsoft Office Access database. To this end, I wrote a trivial test script:

      #!perl # # Characters.pl use strict; use warnings; use open qw( :utf8 :std ); use DBI qw( data_string_desc ); use Encode qw( decode_utf8 ); use English qw( -no_match_vars ); my $mdb = 'Characters.mdb'; my $dsn = "dbi:ODBC:driver=Microsoft Access Driver (*.mdb);dbq=$mdb"; my $dbh = DBI->connect($dsn, '', '', { PrintError => 0, RaiseError => 1, }); my $query = q( SELECT Windows1252Encoding, UnicodeCodePoint, UnicodeCharacter, UnicodeUTF8Encoding, UnicodeName FROM Characters ORDER BY Windows1252Encoding ); my $sth = $dbh->prepare($query); $sth->execute(); local $OUTPUT_RECORD_SEPARATOR = "\n"; local $OUTPUT_FIELD_SEPARATOR = "\t"; while (my $row = $sth->fetchrow_hashref()) { $row->{UnicodeCharacter} = decode_utf8($row->{UnicodeCharacter}); print $row->{Windows1252Encoding}, $row->{UnicodeCodePoint}, $row->{UnicodeCharacter}, $row->{UnicodeUTF8Encoding}, $row->{UnicodeName}, data_string_desc($row->{UnicodeCharacter}); } $dbh->disconnect(); exit 0;

      And here's the data in the Characters table of the Characters.mdb database. All of this data is in Text (255) columns in the Characters table.

      0x80	U+20AC	€	E2 82 AC	EURO SIGN
      0x82	U+201A	‚	E2 80 9A	SINGLE LOW-9 QUOTATION MARK
      0x83	U+0192	ƒ	C6 92	LATIN SMALL LETTER F WITH HOOK
      0x84	U+201E	„	E2 80 9E	DOUBLE LOW-9 QUOTATION MARK
      0x85	U+2026	…	E2 80 A6	HORIZONTAL ELLIPSIS
      0x86	U+2020	†	E2 80 A0	DAGGER
      0x87	U+2021	‡	E2 80 A1	DOUBLE DAGGER
      0x88	U+02C6	ˆ	CB 86	MODIFIER LETTER CIRCUMFLEX ACCENT
      0x89	U+2030	‰	E2 80 B0	PER MILLE SIGN
      0x8A	U+0160	Š	C5 A0	LATIN CAPITAL LETTER S WITH CARON
      0x8B	U+2039	‹	E2 80 B9	SINGLE LEFT-POINTING ANGLE QUOTATION MARK
      0x8C	U+0152	Œ	C5 92	LATIN CAPITAL LIGATURE OE
      0x8E	U+017D	Ž	C5 BD	LATIN CAPITAL LETTER Z WITH CARON
      0x91	U+2018	‘	E2 80 98	LEFT SINGLE QUOTATION MARK
      0x92	U+2019	’	E2 80 99	RIGHT SINGLE QUOTATION MARK
      0x93	U+201C	“	E2 80 9C	LEFT DOUBLE QUOTATION MARK
      0x94	U+201D	”	E2 80 9D	RIGHT DOUBLE QUOTATION MARK
      0x95	U+2022	•	E2 80 A2	BULLET
      0x96	U+2013	–	E2 80 93	EN DASH
      0x97	U+2014	—	E2 80 94	EM DASH
      0x98	U+02DC	˜	CB 9C	SMALL TILDE
      0x99	U+2122	™	E2 84 A2	TRADE MARK SIGN
      0x9A	U+0161	š	C5 A1	LATIN SMALL LETTER S WITH CARON
      0x9B	U+203A	›	E2 80 BA	SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
      0x9C	U+0153	œ	C5 93	LATIN SMALL LIGATURE OE
      0x9E	U+017E	ž	C5 BE	LATIN SMALL LETTER Z WITH CARON
      0x9F	U+0178	Ÿ	C5 B8	LATIN CAPITAL LETTER Y WITH DIAERESIS
      0xA1	U+00A1	¡	C2 A1	INVERTED EXCLAMATION MARK
      0xA2	U+00A2	¢	C2 A2	CENT SIGN
      0xA3	U+00A3	£	C2 A3	POUND SIGN
      0xA4	U+00A4	¤	C2 A4	CURRENCY SIGN
      0xA5	U+00A5	¥	C2 A5	YEN SIGN
      0xA6	U+00A6	¦	C2 A6	BROKEN BAR
      0xA7	U+00A7	§	C2 A7	SECTION SIGN
      0xA8	U+00A8	¨	C2 A8	DIAERESIS
      0xA9	U+00A9	©	C2 A9	COPYRIGHT SIGN
      0xAA	U+00AA	ª	C2 AA	FEMININE ORDINAL INDICATOR
      0xAB	U+00AB	«	C2 AB	LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
      0xAC	U+00AC	¬	C2 AC	NOT SIGN
      0xAD	U+00AD	­	C2 AD	SOFT HYPHEN
      0xAE	U+00AE	®	C2 AE	REGISTERED SIGN
      0xAF	U+00AF	¯	C2 AF	MACRON
      0xB0	U+00B0	°	C2 B0	DEGREE SIGN
      0xB1	U+00B1	±	C2 B1	PLUS-MINUS SIGN
      0xB2	U+00B2	²	C2 B2	SUPERSCRIPT TWO
      0xB3	U+00B3	³	C2 B3	SUPERSCRIPT THREE
      0xB4	U+00B4	´	C2 B4	ACUTE ACCENT
      0xB5	U+00B5	µ	C2 B5	MICRO SIGN
      0xB6	U+00B6	¶	C2 B6	PILCROW SIGN
      0xB7	U+00B7	·	C2 B7	MIDDLE DOT
      0xB8	U+00B8	¸	C2 B8	CEDILLA
      0xB9	U+00B9	¹	C2 B9	SUPERSCRIPT ONE
      0xBA	U+00BA	º	C2 BA	MASCULINE ORDINAL INDICATOR
      0xBB	U+00BB	»	C2 BB	RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
      0xBC	U+00BC	¼	C2 BC	VULGAR FRACTION ONE QUARTER
      0xBD	U+00BD	½	C2 BD	VULGAR FRACTION ONE HALF
      0xBE	U+00BE	¾	C2 BE	VULGAR FRACTION THREE QUARTERS
      0xBF	U+00BF	¿	C2 BF	INVERTED QUESTION MARK
      0xC0	U+00C0	À	C3 80	LATIN CAPITAL LETTER A WITH GRAVE
      0xC1	U+00C1	Á	C3 81	LATIN CAPITAL LETTER A WITH ACUTE
      0xC2	U+00C2	Â	C3 82	LATIN CAPITAL LETTER A WITH CIRCUMFLEX
      0xC3	U+00C3	Ã	C3 83	LATIN CAPITAL LETTER A WITH TILDE
      0xC4	U+00C4	Ä	C3 84	LATIN CAPITAL LETTER A WITH DIAERESIS
      0xC5	U+00C5	Å	C3 85	LATIN CAPITAL LETTER A WITH RING ABOVE
      0xC6	U+00C6	Æ	C3 86	LATIN CAPITAL LETTER AE
      0xC7	U+00C7	Ç	C3 87	LATIN CAPITAL LETTER C WITH CEDILLA
      0xC8	U+00C8	È	C3 88	LATIN CAPITAL LETTER E WITH GRAVE
      0xC9	U+00C9	É	C3 89	LATIN CAPITAL LETTER E WITH ACUTE
      0xCA	U+00CA	Ê	C3 8A	LATIN CAPITAL LETTER E WITH CIRCUMFLEX
      0xCB	U+00CB	Ë	C3 8B	LATIN CAPITAL LETTER E WITH DIAERESIS
      0xCC	U+00CC	Ì	C3 8C	LATIN CAPITAL LETTER I WITH GRAVE
      0xCD	U+00CD	Í	C3 8D	LATIN CAPITAL LETTER I WITH ACUTE
      0xCE	U+00CE	Î	C3 8E	LATIN CAPITAL LETTER I WITH CIRCUMFLEX
      0xCF	U+00CF	Ï	C3 8F	LATIN CAPITAL LETTER I WITH DIAERESIS
      0xD0	U+00D0	Ð	C3 90	LATIN CAPITAL LETTER ETH
      0xD1	U+00D1	Ñ	C3 91	LATIN CAPITAL LETTER N WITH TILDE
      0xD2	U+00D2	Ò	C3 92	LATIN CAPITAL LETTER O WITH GRAVE
      0xD3	U+00D3	Ó	C3 93	LATIN CAPITAL LETTER O WITH ACUTE
      0xD4	U+00D4	Ô	C3 94	LATIN CAPITAL LETTER O WITH CIRCUMFLEX
      0xD5	U+00D5	Õ	C3 95	LATIN CAPITAL LETTER O WITH TILDE
      0xD6	U+00D6	Ö	C3 96	LATIN CAPITAL LETTER O WITH DIAERESIS
      0xD7	U+00D7	×	C3 97	MULTIPLICATION SIGN
      0xD8	U+00D8	Ø	C3 98	LATIN CAPITAL LETTER O WITH STROKE
      0xD9	U+00D9	Ù	C3 99	LATIN CAPITAL LETTER U WITH GRAVE
      0xDA	U+00DA	Ú	C3 9A	LATIN CAPITAL LETTER U WITH ACUTE
      0xDB	U+00DB	Û	C3 9B	LATIN CAPITAL LETTER U WITH CIRCUMFLEX
      0xDC	U+00DC	Ü	C3 9C	LATIN CAPITAL LETTER U WITH DIAERESIS
      0xDD	U+00DD	Ý	C3 9D	LATIN CAPITAL LETTER Y WITH ACUTE
      0xDE	U+00DE	Þ	C3 9E	LATIN CAPITAL LETTER THORN
      0xDF	U+00DF	ß	C3 9F	LATIN SMALL LETTER SHARP S
      0xE0	U+00E0	à	C3 A0	LATIN SMALL LETTER A WITH GRAVE
      0xE1	U+00E1	á	C3 A1	LATIN SMALL LETTER A WITH ACUTE
      0xE2	U+00E2	â	C3 A2	LATIN SMALL LETTER A WITH CIRCUMFLEX
      0xE3	U+00E3	ã	C3 A3	LATIN SMALL LETTER A WITH TILDE
      0xE4	U+00E4	ä	C3 A4	LATIN SMALL LETTER A WITH DIAERESIS
      0xE5	U+00E5	å	C3 A5	LATIN SMALL LETTER A WITH RING ABOVE
      0xE6	U+00E6	æ	C3 A6	LATIN SMALL LETTER AE
      0xE7	U+00E7	ç	C3 A7	LATIN SMALL LETTER C WITH CEDILLA
      0xE8	U+00E8	è	C3 A8	LATIN SMALL LETTER E WITH GRAVE
      0xE9	U+00E9	é	C3 A9	LATIN SMALL LETTER E WITH ACUTE
      0xEA	U+00EA	ê	C3 AA	LATIN SMALL LETTER E WITH CIRCUMFLEX
      0xEB	U+00EB	ë	C3 AB	LATIN SMALL LETTER E WITH DIAERESIS
      0xEC	U+00EC	ì	C3 AC	LATIN SMALL LETTER I WITH GRAVE
      0xED	U+00ED	í	C3 AD	LATIN SMALL LETTER I WITH ACUTE
      0xEE	U+00EE	î	C3 AE	LATIN SMALL LETTER I WITH CIRCUMFLEX
      0xEF	U+00EF	ï	C3 AF	LATIN SMALL LETTER I WITH DIAERESIS
      0xF0	U+00F0	ð	C3 B0	LATIN SMALL LETTER ETH
      0xF1	U+00F1	ñ	C3 B1	LATIN SMALL LETTER N WITH TILDE
      0xF2	U+00F2	ò	C3 B2	LATIN SMALL LETTER O WITH GRAVE
      0xF3	U+00F3	ó	C3 B3	LATIN SMALL LETTER O WITH ACUTE
      0xF4	U+00F4	ô	C3 B4	LATIN SMALL LETTER O WITH CIRCUMFLEX
      0xF5	U+00F5	õ	C3 B5	LATIN SMALL LETTER O WITH TILDE
      0xF6	U+00F6	ö	C3 B6	LATIN SMALL LETTER O WITH DIAERESIS
      0xF7	U+00F7	÷	C3 B7	DIVISION SIGN
      0xF8	U+00F8	ø	C3 B8	LATIN SMALL LETTER O WITH STROKE
      0xF9	U+00F9	ù	C3 B9	LATIN SMALL LETTER U WITH GRAVE
      0xFA	U+00FA	ú	C3 BA	LATIN SMALL LETTER U WITH ACUTE
      0xFB	U+00FB	û	C3 BB	LATIN SMALL LETTER U WITH CIRCUMFLEX
      0xFC	U+00FC	ü	C3 BC	LATIN SMALL LETTER U WITH DIAERESIS
      0xFD	U+00FD	ý	C3 BD	LATIN SMALL LETTER Y WITH ACUTE
      0xFE	U+00FE	þ	C3 BE	LATIN SMALL LETTER THORN
      0xFF	U+00FF	ÿ	C3 BF	LATIN SMALL LETTER Y WITH DIAERESIS
      

      I get the exact same output whether I decode the input or not. The statement…

      $row->{UnicodeCharacter} = decode_utf8($row->{UnicodeCharacter});

      …is a no-op. The string returned by the DBI utility function data_string_desc() is always 'UTF8 on, non-ASCII, 1 characters 2 [or 3] bytes', regardless of whether or not I decode the queried text.

      So do you see how I could easily be lulled into thinking I shouldn't have to decode strings of text (not binary data!) queried from a Text datatype column in a Microsoft Office Access database that stores all data in Unicode?

        On Windows (and UNIX if built for unicode) DBD::ODBC reads UCS2 encoded data from the MS Access database though the ODBC driver manager and MS Access ODBC Driver. DBD::ODBC then decodes that and reencodes it UTF-8 such that Perl sees the strings as unicode. As a result, I believe the decode has been done for you.

        So do you see how I could easily be lulled into thinking I shouldn't have to decode strings of text (not binary data!)

        Good, cause you must not decode strings of text. Decoding is the process of getting text from "binary data".

        The statement $row->{UnicodeCharacter} = decode_utf8($row->{UnicodeCharacter}); is a no-op.

        ug. The Unicode bug :(

        use strict; use warnings; use Test::More tests => 4; use Encode qw( encode_utf8 decode_utf8 ); my $x1 = chr(0xE9); my $y1 = encode_utf8($x1); my $z = encode_utf8($y1); my $y2 = decode_utf8($z); my $x2 = decode_utf8($y2); is($y2, $y1); is($x2, $x1); # Fails isnt($x1, $y1); isnt($y1, $z); 1;
        1..4 ok 1 not ok 2 # Failed test at a.pl line 15. # got: 'é' # expected: 'é' ok 3 ok 4 # Looks like you failed 1 test of 4.

        This is better:

        use strict; use warnings; use Test::More tests => 4; my $x1 = chr(0xE9); utf8::encode( my $y1 = $x1 ); utf8::encode( my $z = $y1 ); utf8::decode( my $y2 = $z ); utf8::decode( my $x2 = $y2 ); is($y2, $y1); is($x2, $x1); isnt($x1, $y1); isnt($y1, $z); 1;
        1..4 ok 1 ok 2 ok 3 ok 4