in reply to Re^2: Mugged by UTF8, this CANNOT be right
in thread Mugged by UTF8, this CANNOT be right

The reason is that the DBDs are buggy, but this wasn't mentioned clearly, causing confusion.

I haven't yet found any explanation of Perl's Unicode model I can understand and use to write Perl programs that handle Unicode text consistently and reliably in any of the Perl documentation or in any discussion threads or tutorials on PerlMonks.

Decode inputs. Encode outputs. This will leave you only the bugs, and one can usually work around them using utf8::upgrade or utf8::downgrade.

The catch is that there are LOTS of inputs and outputs to a program. It would be nice to be warned when one is missed. This would require languages to have different types for encoded and decoded strings. That flags I mentioned earlier would achieve this.

  • Comment on Re^3: Mugged by UTF8, this CANNOT be right

Replies are listed 'Best First'.
Re^4: Mugged by UTF8, this CANNOT be right
by Jim (Curate) on Jan 27, 2011 at 07:38 UTC

    I realize this is the conventional wisdom among the cognoscenti and what some of the Perl documentation tells you to do, but it flies in the face of both the reasonable expectations of many Perl programmers (tosh and I, for example) and a lot of existing documentation. How often do you see a Perl command-line script in a book or a code example in a perldoc page that decodes its input and encodes its output? Not very often.

    I've been trying to sort out an elusive problem I'm having with Unicode text queried from a Microsoft Office Access database. To this end, I wrote a trivial test script:

    #!perl # # Characters.pl use strict; use warnings; use open qw( :utf8 :std ); use DBI qw( data_string_desc ); use Encode qw( decode_utf8 ); use English qw( -no_match_vars ); my $mdb = 'Characters.mdb'; my $dsn = "dbi:ODBC:driver=Microsoft Access Driver (*.mdb);dbq=$mdb"; my $dbh = DBI->connect($dsn, '', '', { PrintError => 0, RaiseError => 1, }); my $query = q( SELECT Windows1252Encoding, UnicodeCodePoint, UnicodeCharacter, UnicodeUTF8Encoding, UnicodeName FROM Characters ORDER BY Windows1252Encoding ); my $sth = $dbh->prepare($query); $sth->execute(); local $OUTPUT_RECORD_SEPARATOR = "\n"; local $OUTPUT_FIELD_SEPARATOR = "\t"; while (my $row = $sth->fetchrow_hashref()) { $row->{UnicodeCharacter} = decode_utf8($row->{UnicodeCharacter}); print $row->{Windows1252Encoding}, $row->{UnicodeCodePoint}, $row->{UnicodeCharacter}, $row->{UnicodeUTF8Encoding}, $row->{UnicodeName}, data_string_desc($row->{UnicodeCharacter}); } $dbh->disconnect(); exit 0;

    And here's the data in the Characters table of the Characters.mdb database. All of this data is in Text (255) columns in the Characters table.

    0x80	U+20AC	€	E2 82 AC	EURO SIGN
    0x82	U+201A	‚	E2 80 9A	SINGLE LOW-9 QUOTATION MARK
    0x83	U+0192	ƒ	C6 92	LATIN SMALL LETTER F WITH HOOK
    0x84	U+201E	„	E2 80 9E	DOUBLE LOW-9 QUOTATION MARK
    0x85	U+2026	…	E2 80 A6	HORIZONTAL ELLIPSIS
    0x86	U+2020	†	E2 80 A0	DAGGER
    0x87	U+2021	‡	E2 80 A1	DOUBLE DAGGER
    0x88	U+02C6	ˆ	CB 86	MODIFIER LETTER CIRCUMFLEX ACCENT
    0x89	U+2030	‰	E2 80 B0	PER MILLE SIGN
    0x8A	U+0160	Š	C5 A0	LATIN CAPITAL LETTER S WITH CARON
    0x8B	U+2039	‹	E2 80 B9	SINGLE LEFT-POINTING ANGLE QUOTATION MARK
    0x8C	U+0152	Œ	C5 92	LATIN CAPITAL LIGATURE OE
    0x8E	U+017D	Ž	C5 BD	LATIN CAPITAL LETTER Z WITH CARON
    0x91	U+2018	‘	E2 80 98	LEFT SINGLE QUOTATION MARK
    0x92	U+2019	’	E2 80 99	RIGHT SINGLE QUOTATION MARK
    0x93	U+201C	“	E2 80 9C	LEFT DOUBLE QUOTATION MARK
    0x94	U+201D	”	E2 80 9D	RIGHT DOUBLE QUOTATION MARK
    0x95	U+2022	•	E2 80 A2	BULLET
    0x96	U+2013	–	E2 80 93	EN DASH
    0x97	U+2014	—	E2 80 94	EM DASH
    0x98	U+02DC	˜	CB 9C	SMALL TILDE
    0x99	U+2122	™	E2 84 A2	TRADE MARK SIGN
    0x9A	U+0161	š	C5 A1	LATIN SMALL LETTER S WITH CARON
    0x9B	U+203A	›	E2 80 BA	SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
    0x9C	U+0153	œ	C5 93	LATIN SMALL LIGATURE OE
    0x9E	U+017E	ž	C5 BE	LATIN SMALL LETTER Z WITH CARON
    0x9F	U+0178	Ÿ	C5 B8	LATIN CAPITAL LETTER Y WITH DIAERESIS
    0xA1	U+00A1	¡	C2 A1	INVERTED EXCLAMATION MARK
    0xA2	U+00A2	¢	C2 A2	CENT SIGN
    0xA3	U+00A3	£	C2 A3	POUND SIGN
    0xA4	U+00A4	¤	C2 A4	CURRENCY SIGN
    0xA5	U+00A5	¥	C2 A5	YEN SIGN
    0xA6	U+00A6	¦	C2 A6	BROKEN BAR
    0xA7	U+00A7	§	C2 A7	SECTION SIGN
    0xA8	U+00A8	¨	C2 A8	DIAERESIS
    0xA9	U+00A9	©	C2 A9	COPYRIGHT SIGN
    0xAA	U+00AA	ª	C2 AA	FEMININE ORDINAL INDICATOR
    0xAB	U+00AB	«	C2 AB	LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
    0xAC	U+00AC	¬	C2 AC	NOT SIGN
    0xAD	U+00AD	­	C2 AD	SOFT HYPHEN
    0xAE	U+00AE	®	C2 AE	REGISTERED SIGN
    0xAF	U+00AF	¯	C2 AF	MACRON
    0xB0	U+00B0	°	C2 B0	DEGREE SIGN
    0xB1	U+00B1	±	C2 B1	PLUS-MINUS SIGN
    0xB2	U+00B2	²	C2 B2	SUPERSCRIPT TWO
    0xB3	U+00B3	³	C2 B3	SUPERSCRIPT THREE
    0xB4	U+00B4	´	C2 B4	ACUTE ACCENT
    0xB5	U+00B5	µ	C2 B5	MICRO SIGN
    0xB6	U+00B6	¶	C2 B6	PILCROW SIGN
    0xB7	U+00B7	·	C2 B7	MIDDLE DOT
    0xB8	U+00B8	¸	C2 B8	CEDILLA
    0xB9	U+00B9	¹	C2 B9	SUPERSCRIPT ONE
    0xBA	U+00BA	º	C2 BA	MASCULINE ORDINAL INDICATOR
    0xBB	U+00BB	»	C2 BB	RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
    0xBC	U+00BC	¼	C2 BC	VULGAR FRACTION ONE QUARTER
    0xBD	U+00BD	½	C2 BD	VULGAR FRACTION ONE HALF
    0xBE	U+00BE	¾	C2 BE	VULGAR FRACTION THREE QUARTERS
    0xBF	U+00BF	¿	C2 BF	INVERTED QUESTION MARK
    0xC0	U+00C0	À	C3 80	LATIN CAPITAL LETTER A WITH GRAVE
    0xC1	U+00C1	Á	C3 81	LATIN CAPITAL LETTER A WITH ACUTE
    0xC2	U+00C2	Â	C3 82	LATIN CAPITAL LETTER A WITH CIRCUMFLEX
    0xC3	U+00C3	Ã	C3 83	LATIN CAPITAL LETTER A WITH TILDE
    0xC4	U+00C4	Ä	C3 84	LATIN CAPITAL LETTER A WITH DIAERESIS
    0xC5	U+00C5	Å	C3 85	LATIN CAPITAL LETTER A WITH RING ABOVE
    0xC6	U+00C6	Æ	C3 86	LATIN CAPITAL LETTER AE
    0xC7	U+00C7	Ç	C3 87	LATIN CAPITAL LETTER C WITH CEDILLA
    0xC8	U+00C8	È	C3 88	LATIN CAPITAL LETTER E WITH GRAVE
    0xC9	U+00C9	É	C3 89	LATIN CAPITAL LETTER E WITH ACUTE
    0xCA	U+00CA	Ê	C3 8A	LATIN CAPITAL LETTER E WITH CIRCUMFLEX
    0xCB	U+00CB	Ë	C3 8B	LATIN CAPITAL LETTER E WITH DIAERESIS
    0xCC	U+00CC	Ì	C3 8C	LATIN CAPITAL LETTER I WITH GRAVE
    0xCD	U+00CD	Í	C3 8D	LATIN CAPITAL LETTER I WITH ACUTE
    0xCE	U+00CE	Î	C3 8E	LATIN CAPITAL LETTER I WITH CIRCUMFLEX
    0xCF	U+00CF	Ï	C3 8F	LATIN CAPITAL LETTER I WITH DIAERESIS
    0xD0	U+00D0	Ð	C3 90	LATIN CAPITAL LETTER ETH
    0xD1	U+00D1	Ñ	C3 91	LATIN CAPITAL LETTER N WITH TILDE
    0xD2	U+00D2	Ò	C3 92	LATIN CAPITAL LETTER O WITH GRAVE
    0xD3	U+00D3	Ó	C3 93	LATIN CAPITAL LETTER O WITH ACUTE
    0xD4	U+00D4	Ô	C3 94	LATIN CAPITAL LETTER O WITH CIRCUMFLEX
    0xD5	U+00D5	Õ	C3 95	LATIN CAPITAL LETTER O WITH TILDE
    0xD6	U+00D6	Ö	C3 96	LATIN CAPITAL LETTER O WITH DIAERESIS
    0xD7	U+00D7	×	C3 97	MULTIPLICATION SIGN
    0xD8	U+00D8	Ø	C3 98	LATIN CAPITAL LETTER O WITH STROKE
    0xD9	U+00D9	Ù	C3 99	LATIN CAPITAL LETTER U WITH GRAVE
    0xDA	U+00DA	Ú	C3 9A	LATIN CAPITAL LETTER U WITH ACUTE
    0xDB	U+00DB	Û	C3 9B	LATIN CAPITAL LETTER U WITH CIRCUMFLEX
    0xDC	U+00DC	Ü	C3 9C	LATIN CAPITAL LETTER U WITH DIAERESIS
    0xDD	U+00DD	Ý	C3 9D	LATIN CAPITAL LETTER Y WITH ACUTE
    0xDE	U+00DE	Þ	C3 9E	LATIN CAPITAL LETTER THORN
    0xDF	U+00DF	ß	C3 9F	LATIN SMALL LETTER SHARP S
    0xE0	U+00E0	à	C3 A0	LATIN SMALL LETTER A WITH GRAVE
    0xE1	U+00E1	á	C3 A1	LATIN SMALL LETTER A WITH ACUTE
    0xE2	U+00E2	â	C3 A2	LATIN SMALL LETTER A WITH CIRCUMFLEX
    0xE3	U+00E3	ã	C3 A3	LATIN SMALL LETTER A WITH TILDE
    0xE4	U+00E4	ä	C3 A4	LATIN SMALL LETTER A WITH DIAERESIS
    0xE5	U+00E5	å	C3 A5	LATIN SMALL LETTER A WITH RING ABOVE
    0xE6	U+00E6	æ	C3 A6	LATIN SMALL LETTER AE
    0xE7	U+00E7	ç	C3 A7	LATIN SMALL LETTER C WITH CEDILLA
    0xE8	U+00E8	è	C3 A8	LATIN SMALL LETTER E WITH GRAVE
    0xE9	U+00E9	é	C3 A9	LATIN SMALL LETTER E WITH ACUTE
    0xEA	U+00EA	ê	C3 AA	LATIN SMALL LETTER E WITH CIRCUMFLEX
    0xEB	U+00EB	ë	C3 AB	LATIN SMALL LETTER E WITH DIAERESIS
    0xEC	U+00EC	ì	C3 AC	LATIN SMALL LETTER I WITH GRAVE
    0xED	U+00ED	í	C3 AD	LATIN SMALL LETTER I WITH ACUTE
    0xEE	U+00EE	î	C3 AE	LATIN SMALL LETTER I WITH CIRCUMFLEX
    0xEF	U+00EF	ï	C3 AF	LATIN SMALL LETTER I WITH DIAERESIS
    0xF0	U+00F0	ð	C3 B0	LATIN SMALL LETTER ETH
    0xF1	U+00F1	ñ	C3 B1	LATIN SMALL LETTER N WITH TILDE
    0xF2	U+00F2	ò	C3 B2	LATIN SMALL LETTER O WITH GRAVE
    0xF3	U+00F3	ó	C3 B3	LATIN SMALL LETTER O WITH ACUTE
    0xF4	U+00F4	ô	C3 B4	LATIN SMALL LETTER O WITH CIRCUMFLEX
    0xF5	U+00F5	õ	C3 B5	LATIN SMALL LETTER O WITH TILDE
    0xF6	U+00F6	ö	C3 B6	LATIN SMALL LETTER O WITH DIAERESIS
    0xF7	U+00F7	÷	C3 B7	DIVISION SIGN
    0xF8	U+00F8	ø	C3 B8	LATIN SMALL LETTER O WITH STROKE
    0xF9	U+00F9	ù	C3 B9	LATIN SMALL LETTER U WITH GRAVE
    0xFA	U+00FA	ú	C3 BA	LATIN SMALL LETTER U WITH ACUTE
    0xFB	U+00FB	û	C3 BB	LATIN SMALL LETTER U WITH CIRCUMFLEX
    0xFC	U+00FC	ü	C3 BC	LATIN SMALL LETTER U WITH DIAERESIS
    0xFD	U+00FD	ý	C3 BD	LATIN SMALL LETTER Y WITH ACUTE
    0xFE	U+00FE	þ	C3 BE	LATIN SMALL LETTER THORN
    0xFF	U+00FF	ÿ	C3 BF	LATIN SMALL LETTER Y WITH DIAERESIS
    

    I get the exact same output whether I decode the input or not. The statement…

    $row->{UnicodeCharacter} = decode_utf8($row->{UnicodeCharacter});

    …is a no-op. The string returned by the DBI utility function data_string_desc() is always 'UTF8 on, non-ASCII, 1 characters 2 [or 3] bytes', regardless of whether or not I decode the queried text.

    So do you see how I could easily be lulled into thinking I shouldn't have to decode strings of text (not binary data!) queried from a Text datatype column in a Microsoft Office Access database that stores all data in Unicode?

      On Windows (and UNIX if built for unicode) DBD::ODBC reads UCS2 encoded data from the MS Access database though the ODBC driver manager and MS Access ODBC Driver. DBD::ODBC then decodes that and reencodes it UTF-8 such that Perl sees the strings as unicode. As a result, I believe the decode has been done for you.

        Thank you, ++mje! This is very helpful to know. And it's authoritative, too, as it's coming straight from the maintainer of DBD::ODBC.

        Your post prompted me to reread the documentation of DBD::ODBC's Unicode support more carefully. Among the wealth of detailed information about Unicode and various different drivers for different RDBMSes, the documentation does state:

        DBD::ODBC uses the wide character versions of the ODBC API and the SQL_WCHAR ODBC type to support unicode in Perl.
        Wide characters returned from the ODBC driver will be converted to UTF-8 and the perl scalars will have the utf8 flag set (by using sv_utf8_decode).
        perl scalars which are UTF-8 and are sent through the ODBC API will be converted to UTF-16 and passed to the ODBC wide APIs or signalled as SQL_WCHARs (e.g., in the case of bound columns).

        I think it might be helpful to have an entry in the DBD::ODBC FAQ like "How Do I Handle Unicode Text With MS Access?" that simply and plainly explains that, mostly, it should Just Work. (Shouldn't it?)

      So do you see how I could easily be lulled into thinking I shouldn't have to decode strings of text (not binary data!)

      Good, cause you must not decode strings of text. Decoding is the process of getting text from "binary data".

      The statement $row->{UnicodeCharacter} = decode_utf8($row->{UnicodeCharacter}); is a no-op.

      ug. The Unicode bug :(

      use strict; use warnings; use Test::More tests => 4; use Encode qw( encode_utf8 decode_utf8 ); my $x1 = chr(0xE9); my $y1 = encode_utf8($x1); my $z = encode_utf8($y1); my $y2 = decode_utf8($z); my $x2 = decode_utf8($y2); is($y2, $y1); is($x2, $x1); # Fails isnt($x1, $y1); isnt($y1, $z); 1;
      1..4 ok 1 not ok 2 # Failed test at a.pl line 15. # got: 'é' # expected: 'é' ok 3 ok 4 # Looks like you failed 1 test of 4.

      This is better:

      use strict; use warnings; use Test::More tests => 4; my $x1 = chr(0xE9); utf8::encode( my $y1 = $x1 ); utf8::encode( my $z = $y1 ); utf8::decode( my $y2 = $z ); utf8::decode( my $x2 = $y2 ); is($y2, $y1); is($x2, $x1); isnt($x1, $y1); isnt($y1, $z); 1;
      1..4 ok 1 ok 2 ok 3 ok 4
        You see? 50% of your code was dealing with encoding. That's "wrong". Programming shouldn't be like that.

        I have been preached to over the years, over and over and over, that Perl is great and "does the right thing" and it always has, until now. Now Perl is completely obliterated by Unicode/UTF8|16 in the "doing the right thing" department. Now suddenly, instead of following process logic, I have to be worrying about what format strings are?! If I have to worry about these things then I might as well go back to C or assembly and at least get a performance boost for my efforts.

        But yes ikegami, you are correct, it IS possible to work with UTF8 just somewhat impractical, and I would put forth the theory that the reason no definitive guide to working with UTF8 and Unicode in Perl exists is because by the time a 100% no-fail solution is written up the author is forced to realize that the solution is too damaging to the reputation of the language.

        Tosh