in reply to Re^3: Mugged by UTF8, this CANNOT be right
in thread Mugged by UTF8, this CANNOT be right
I realize this is the conventional wisdom among the cognoscenti and what some of the Perl documentation tells you to do, but it flies in the face of both the reasonable expectations of many Perl programmers (tosh and I, for example) and a lot of existing documentation. How often do you see a Perl command-line script in a book or a code example in a perldoc page that decodes its input and encodes its output? Not very often.
I've been trying to sort out an elusive problem I'm having with Unicode text queried from a Microsoft Office Access database. To this end, I wrote a trivial test script:
#!perl # # Characters.pl use strict; use warnings; use open qw( :utf8 :std ); use DBI qw( data_string_desc ); use Encode qw( decode_utf8 ); use English qw( -no_match_vars ); my $mdb = 'Characters.mdb'; my $dsn = "dbi:ODBC:driver=Microsoft Access Driver (*.mdb);dbq=$mdb"; my $dbh = DBI->connect($dsn, '', '', { PrintError => 0, RaiseError => 1, }); my $query = q( SELECT Windows1252Encoding, UnicodeCodePoint, UnicodeCharacter, UnicodeUTF8Encoding, UnicodeName FROM Characters ORDER BY Windows1252Encoding ); my $sth = $dbh->prepare($query); $sth->execute(); local $OUTPUT_RECORD_SEPARATOR = "\n"; local $OUTPUT_FIELD_SEPARATOR = "\t"; while (my $row = $sth->fetchrow_hashref()) { $row->{UnicodeCharacter} = decode_utf8($row->{UnicodeCharacter}); print $row->{Windows1252Encoding}, $row->{UnicodeCodePoint}, $row->{UnicodeCharacter}, $row->{UnicodeUTF8Encoding}, $row->{UnicodeName}, data_string_desc($row->{UnicodeCharacter}); } $dbh->disconnect(); exit 0;
And here's the data in the Characters table of the Characters.mdb database. All of this data is in Text (255) columns in the Characters table.
0x80 U+20AC € E2 82 AC EURO SIGN 0x82 U+201A ‚ E2 80 9A SINGLE LOW-9 QUOTATION MARK 0x83 U+0192 ƒ C6 92 LATIN SMALL LETTER F WITH HOOK 0x84 U+201E „ E2 80 9E DOUBLE LOW-9 QUOTATION MARK 0x85 U+2026 … E2 80 A6 HORIZONTAL ELLIPSIS 0x86 U+2020 † E2 80 A0 DAGGER 0x87 U+2021 ‡ E2 80 A1 DOUBLE DAGGER 0x88 U+02C6 ˆ CB 86 MODIFIER LETTER CIRCUMFLEX ACCENT 0x89 U+2030 ‰ E2 80 B0 PER MILLE SIGN 0x8A U+0160 Š C5 A0 LATIN CAPITAL LETTER S WITH CARON 0x8B U+2039 ‹ E2 80 B9 SINGLE LEFT-POINTING ANGLE QUOTATION MARK 0x8C U+0152 Œ C5 92 LATIN CAPITAL LIGATURE OE 0x8E U+017D Ž C5 BD LATIN CAPITAL LETTER Z WITH CARON 0x91 U+2018 ‘ E2 80 98 LEFT SINGLE QUOTATION MARK 0x92 U+2019 ’ E2 80 99 RIGHT SINGLE QUOTATION MARK 0x93 U+201C “ E2 80 9C LEFT DOUBLE QUOTATION MARK 0x94 U+201D ” E2 80 9D RIGHT DOUBLE QUOTATION MARK 0x95 U+2022 • E2 80 A2 BULLET 0x96 U+2013 – E2 80 93 EN DASH 0x97 U+2014 — E2 80 94 EM DASH 0x98 U+02DC ˜ CB 9C SMALL TILDE 0x99 U+2122 ™ E2 84 A2 TRADE MARK SIGN 0x9A U+0161 š C5 A1 LATIN SMALL LETTER S WITH CARON 0x9B U+203A › E2 80 BA SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 0x9C U+0153 œ C5 93 LATIN SMALL LIGATURE OE 0x9E U+017E ž C5 BE LATIN SMALL LETTER Z WITH CARON 0x9F U+0178 Ÿ C5 B8 LATIN CAPITAL LETTER Y WITH DIAERESIS 0xA1 U+00A1 ¡ C2 A1 INVERTED EXCLAMATION MARK 0xA2 U+00A2 ¢ C2 A2 CENT SIGN 0xA3 U+00A3 £ C2 A3 POUND SIGN 0xA4 U+00A4 ¤ C2 A4 CURRENCY SIGN 0xA5 U+00A5 ¥ C2 A5 YEN SIGN 0xA6 U+00A6 ¦ C2 A6 BROKEN BAR 0xA7 U+00A7 § C2 A7 SECTION SIGN 0xA8 U+00A8 ¨ C2 A8 DIAERESIS 0xA9 U+00A9 © C2 A9 COPYRIGHT SIGN 0xAA U+00AA ª C2 AA FEMININE ORDINAL INDICATOR 0xAB U+00AB « C2 AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 0xAC U+00AC ¬ C2 AC NOT SIGN 0xAD U+00AD C2 AD SOFT HYPHEN 0xAE U+00AE ® C2 AE REGISTERED SIGN 0xAF U+00AF ¯ C2 AF MACRON 0xB0 U+00B0 ° C2 B0 DEGREE SIGN 0xB1 U+00B1 ± C2 B1 PLUS-MINUS SIGN 0xB2 U+00B2 ² C2 B2 SUPERSCRIPT TWO 0xB3 U+00B3 ³ C2 B3 SUPERSCRIPT THREE 0xB4 U+00B4 ´ C2 B4 ACUTE ACCENT 0xB5 U+00B5 µ C2 B5 MICRO SIGN 0xB6 U+00B6 ¶ C2 B6 PILCROW SIGN 0xB7 U+00B7 · C2 B7 MIDDLE DOT 0xB8 U+00B8 ¸ C2 B8 CEDILLA 0xB9 U+00B9 ¹ C2 B9 SUPERSCRIPT ONE 0xBA U+00BA º C2 BA MASCULINE ORDINAL INDICATOR 0xBB U+00BB » C2 BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 0xBC U+00BC ¼ C2 BC VULGAR FRACTION ONE QUARTER 0xBD U+00BD ½ C2 BD VULGAR FRACTION ONE HALF 0xBE U+00BE ¾ C2 BE VULGAR FRACTION THREE QUARTERS 0xBF U+00BF ¿ C2 BF INVERTED QUESTION MARK 0xC0 U+00C0 À C3 80 LATIN CAPITAL LETTER A WITH GRAVE 0xC1 U+00C1 Á C3 81 LATIN CAPITAL LETTER A WITH ACUTE 0xC2 U+00C2  C3 82 LATIN CAPITAL LETTER A WITH CIRCUMFLEX 0xC3 U+00C3 à C3 83 LATIN CAPITAL LETTER A WITH TILDE 0xC4 U+00C4 Ä C3 84 LATIN CAPITAL LETTER A WITH DIAERESIS 0xC5 U+00C5 Å C3 85 LATIN CAPITAL LETTER A WITH RING ABOVE 0xC6 U+00C6 Æ C3 86 LATIN CAPITAL LETTER AE 0xC7 U+00C7 Ç C3 87 LATIN CAPITAL LETTER C WITH CEDILLA 0xC8 U+00C8 È C3 88 LATIN CAPITAL LETTER E WITH GRAVE 0xC9 U+00C9 É C3 89 LATIN CAPITAL LETTER E WITH ACUTE 0xCA U+00CA Ê C3 8A LATIN CAPITAL LETTER E WITH CIRCUMFLEX 0xCB U+00CB Ë C3 8B LATIN CAPITAL LETTER E WITH DIAERESIS 0xCC U+00CC Ì C3 8C LATIN CAPITAL LETTER I WITH GRAVE 0xCD U+00CD Í C3 8D LATIN CAPITAL LETTER I WITH ACUTE 0xCE U+00CE Î C3 8E LATIN CAPITAL LETTER I WITH CIRCUMFLEX 0xCF U+00CF Ï C3 8F LATIN CAPITAL LETTER I WITH DIAERESIS 0xD0 U+00D0 Ð C3 90 LATIN CAPITAL LETTER ETH 0xD1 U+00D1 Ñ C3 91 LATIN CAPITAL LETTER N WITH TILDE 0xD2 U+00D2 Ò C3 92 LATIN CAPITAL LETTER O WITH GRAVE 0xD3 U+00D3 Ó C3 93 LATIN CAPITAL LETTER O WITH ACUTE 0xD4 U+00D4 Ô C3 94 LATIN CAPITAL LETTER O WITH CIRCUMFLEX 0xD5 U+00D5 Õ C3 95 LATIN CAPITAL LETTER O WITH TILDE 0xD6 U+00D6 Ö C3 96 LATIN CAPITAL LETTER O WITH DIAERESIS 0xD7 U+00D7 × C3 97 MULTIPLICATION SIGN 0xD8 U+00D8 Ø C3 98 LATIN CAPITAL LETTER O WITH STROKE 0xD9 U+00D9 Ù C3 99 LATIN CAPITAL LETTER U WITH GRAVE 0xDA U+00DA Ú C3 9A LATIN CAPITAL LETTER U WITH ACUTE 0xDB U+00DB Û C3 9B LATIN CAPITAL LETTER U WITH CIRCUMFLEX 0xDC U+00DC Ü C3 9C LATIN CAPITAL LETTER U WITH DIAERESIS 0xDD U+00DD Ý C3 9D LATIN CAPITAL LETTER Y WITH ACUTE 0xDE U+00DE Þ C3 9E LATIN CAPITAL LETTER THORN 0xDF U+00DF ß C3 9F LATIN SMALL LETTER SHARP S 0xE0 U+00E0 à C3 A0 LATIN SMALL LETTER A WITH GRAVE 0xE1 U+00E1 á C3 A1 LATIN SMALL LETTER A WITH ACUTE 0xE2 U+00E2 â C3 A2 LATIN SMALL LETTER A WITH CIRCUMFLEX 0xE3 U+00E3 ã C3 A3 LATIN SMALL LETTER A WITH TILDE 0xE4 U+00E4 ä C3 A4 LATIN SMALL LETTER A WITH DIAERESIS 0xE5 U+00E5 å C3 A5 LATIN SMALL LETTER A WITH RING ABOVE 0xE6 U+00E6 æ C3 A6 LATIN SMALL LETTER AE 0xE7 U+00E7 ç C3 A7 LATIN SMALL LETTER C WITH CEDILLA 0xE8 U+00E8 è C3 A8 LATIN SMALL LETTER E WITH GRAVE 0xE9 U+00E9 é C3 A9 LATIN SMALL LETTER E WITH ACUTE 0xEA U+00EA ê C3 AA LATIN SMALL LETTER E WITH CIRCUMFLEX 0xEB U+00EB ë C3 AB LATIN SMALL LETTER E WITH DIAERESIS 0xEC U+00EC ì C3 AC LATIN SMALL LETTER I WITH GRAVE 0xED U+00ED í C3 AD LATIN SMALL LETTER I WITH ACUTE 0xEE U+00EE î C3 AE LATIN SMALL LETTER I WITH CIRCUMFLEX 0xEF U+00EF ï C3 AF LATIN SMALL LETTER I WITH DIAERESIS 0xF0 U+00F0 ð C3 B0 LATIN SMALL LETTER ETH 0xF1 U+00F1 ñ C3 B1 LATIN SMALL LETTER N WITH TILDE 0xF2 U+00F2 ò C3 B2 LATIN SMALL LETTER O WITH GRAVE 0xF3 U+00F3 ó C3 B3 LATIN SMALL LETTER O WITH ACUTE 0xF4 U+00F4 ô C3 B4 LATIN SMALL LETTER O WITH CIRCUMFLEX 0xF5 U+00F5 õ C3 B5 LATIN SMALL LETTER O WITH TILDE 0xF6 U+00F6 ö C3 B6 LATIN SMALL LETTER O WITH DIAERESIS 0xF7 U+00F7 ÷ C3 B7 DIVISION SIGN 0xF8 U+00F8 ø C3 B8 LATIN SMALL LETTER O WITH STROKE 0xF9 U+00F9 ù C3 B9 LATIN SMALL LETTER U WITH GRAVE 0xFA U+00FA ú C3 BA LATIN SMALL LETTER U WITH ACUTE 0xFB U+00FB û C3 BB LATIN SMALL LETTER U WITH CIRCUMFLEX 0xFC U+00FC ü C3 BC LATIN SMALL LETTER U WITH DIAERESIS 0xFD U+00FD ý C3 BD LATIN SMALL LETTER Y WITH ACUTE 0xFE U+00FE þ C3 BE LATIN SMALL LETTER THORN 0xFF U+00FF ÿ C3 BF LATIN SMALL LETTER Y WITH DIAERESIS
I get the exact same output whether I decode the input or not. The statement…
$row->{UnicodeCharacter} = decode_utf8($row->{UnicodeCharacter});
…is a no-op. The string returned by the DBI utility function data_string_desc() is always 'UTF8 on, non-ASCII, 1 characters 2 [or 3] bytes', regardless of whether or not I decode the queried text.
So do you see how I could easily be lulled into thinking I shouldn't have to decode strings of text (not binary data!) queried from a Text datatype column in a Microsoft Office Access database that stores all data in Unicode?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Mugged by UTF8, this CANNOT be right
by mje (Curate) on Jan 27, 2011 at 10:30 UTC | |
by Jim (Curate) on Jan 27, 2011 at 16:54 UTC | |
by mje (Curate) on Jan 27, 2011 at 17:29 UTC | |
|
Re^5: Mugged by UTF8, this CANNOT be right
by ikegami (Patriarch) on Jan 27, 2011 at 08:44 UTC | |
by tosh (Scribe) on Jan 27, 2011 at 09:39 UTC | |
by ikegami (Patriarch) on Jan 27, 2011 at 18:17 UTC |