I realize this is the conventional wisdom among the cognoscenti and what some of the Perl documentation tells you to do, but it flies in the face of both the reasonable expectations of many Perl programmers (tosh and I, for example) and a lot of existing documentation. How often do you see a Perl command-line script in a book or a code example in a perldoc page that decodes its input and encodes its output? Not very often.

I've been trying to sort out an elusive problem I'm having with Unicode text queried from a Microsoft Office Access database. To this end, I wrote a trivial test script:

#!perl # # Characters.pl use strict; use warnings; use open qw( :utf8 :std ); use DBI qw( data_string_desc ); use Encode qw( decode_utf8 ); use English qw( -no_match_vars ); my $mdb = 'Characters.mdb'; my $dsn = "dbi:ODBC:driver=Microsoft Access Driver (*.mdb);dbq=$mdb"; my $dbh = DBI->connect($dsn, '', '', { PrintError => 0, RaiseError => 1, }); my $query = q( SELECT Windows1252Encoding, UnicodeCodePoint, UnicodeCharacter, UnicodeUTF8Encoding, UnicodeName FROM Characters ORDER BY Windows1252Encoding ); my $sth = $dbh->prepare($query); $sth->execute(); local $OUTPUT_RECORD_SEPARATOR = "\n"; local $OUTPUT_FIELD_SEPARATOR = "\t"; while (my $row = $sth->fetchrow_hashref()) { $row->{UnicodeCharacter} = decode_utf8($row->{UnicodeCharacter}); print $row->{Windows1252Encoding}, $row->{UnicodeCodePoint}, $row->{UnicodeCharacter}, $row->{UnicodeUTF8Encoding}, $row->{UnicodeName}, data_string_desc($row->{UnicodeCharacter}); } $dbh->disconnect(); exit 0;

And here's the data in the Characters table of the Characters.mdb database. All of this data is in Text (255) columns in the Characters table.

0x80	U+20AC	€	E2 82 AC	EURO SIGN
0x82	U+201A	‚	E2 80 9A	SINGLE LOW-9 QUOTATION MARK
0x83	U+0192	ƒ	C6 92	LATIN SMALL LETTER F WITH HOOK
0x84	U+201E	„	E2 80 9E	DOUBLE LOW-9 QUOTATION MARK
0x85	U+2026	…	E2 80 A6	HORIZONTAL ELLIPSIS
0x86	U+2020	†	E2 80 A0	DAGGER
0x87	U+2021	‡	E2 80 A1	DOUBLE DAGGER
0x88	U+02C6	ˆ	CB 86	MODIFIER LETTER CIRCUMFLEX ACCENT
0x89	U+2030	‰	E2 80 B0	PER MILLE SIGN
0x8A	U+0160	Š	C5 A0	LATIN CAPITAL LETTER S WITH CARON
0x8B	U+2039	‹	E2 80 B9	SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x8C	U+0152	Œ	C5 92	LATIN CAPITAL LIGATURE OE
0x8E	U+017D	Ž	C5 BD	LATIN CAPITAL LETTER Z WITH CARON
0x91	U+2018	‘	E2 80 98	LEFT SINGLE QUOTATION MARK
0x92	U+2019	’	E2 80 99	RIGHT SINGLE QUOTATION MARK
0x93	U+201C	“	E2 80 9C	LEFT DOUBLE QUOTATION MARK
0x94	U+201D	”	E2 80 9D	RIGHT DOUBLE QUOTATION MARK
0x95	U+2022	•	E2 80 A2	BULLET
0x96	U+2013	–	E2 80 93	EN DASH
0x97	U+2014	—	E2 80 94	EM DASH
0x98	U+02DC	˜	CB 9C	SMALL TILDE
0x99	U+2122	™	E2 84 A2	TRADE MARK SIGN
0x9A	U+0161	š	C5 A1	LATIN SMALL LETTER S WITH CARON
0x9B	U+203A	›	E2 80 BA	SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9C	U+0153	œ	C5 93	LATIN SMALL LIGATURE OE
0x9E	U+017E	ž	C5 BE	LATIN SMALL LETTER Z WITH CARON
0x9F	U+0178	Ÿ	C5 B8	LATIN CAPITAL LETTER Y WITH DIAERESIS
0xA1	U+00A1	¡	C2 A1	INVERTED EXCLAMATION MARK
0xA2	U+00A2	¢	C2 A2	CENT SIGN
0xA3	U+00A3	£	C2 A3	POUND SIGN
0xA4	U+00A4	¤	C2 A4	CURRENCY SIGN
0xA5	U+00A5	¥	C2 A5	YEN SIGN
0xA6	U+00A6	¦	C2 A6	BROKEN BAR
0xA7	U+00A7	§	C2 A7	SECTION SIGN
0xA8	U+00A8	¨	C2 A8	DIAERESIS
0xA9	U+00A9	©	C2 A9	COPYRIGHT SIGN
0xAA	U+00AA	ª	C2 AA	FEMININE ORDINAL INDICATOR
0xAB	U+00AB	«	C2 AB	LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
0xAC	U+00AC	¬	C2 AC	NOT SIGN
0xAD	U+00AD	­	C2 AD	SOFT HYPHEN
0xAE	U+00AE	®	C2 AE	REGISTERED SIGN
0xAF	U+00AF	¯	C2 AF	MACRON
0xB0	U+00B0	°	C2 B0	DEGREE SIGN
0xB1	U+00B1	±	C2 B1	PLUS-MINUS SIGN
0xB2	U+00B2	²	C2 B2	SUPERSCRIPT TWO
0xB3	U+00B3	³	C2 B3	SUPERSCRIPT THREE
0xB4	U+00B4	´	C2 B4	ACUTE ACCENT
0xB5	U+00B5	µ	C2 B5	MICRO SIGN
0xB6	U+00B6	¶	C2 B6	PILCROW SIGN
0xB7	U+00B7	·	C2 B7	MIDDLE DOT
0xB8	U+00B8	¸	C2 B8	CEDILLA
0xB9	U+00B9	¹	C2 B9	SUPERSCRIPT ONE
0xBA	U+00BA	º	C2 BA	MASCULINE ORDINAL INDICATOR
0xBB	U+00BB	»	C2 BB	RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
0xBC	U+00BC	¼	C2 BC	VULGAR FRACTION ONE QUARTER
0xBD	U+00BD	½	C2 BD	VULGAR FRACTION ONE HALF
0xBE	U+00BE	¾	C2 BE	VULGAR FRACTION THREE QUARTERS
0xBF	U+00BF	¿	C2 BF	INVERTED QUESTION MARK
0xC0	U+00C0	À	C3 80	LATIN CAPITAL LETTER A WITH GRAVE
0xC1	U+00C1	Á	C3 81	LATIN CAPITAL LETTER A WITH ACUTE
0xC2	U+00C2	Â	C3 82	LATIN CAPITAL LETTER A WITH CIRCUMFLEX
0xC3	U+00C3	Ã	C3 83	LATIN CAPITAL LETTER A WITH TILDE
0xC4	U+00C4	Ä	C3 84	LATIN CAPITAL LETTER A WITH DIAERESIS
0xC5	U+00C5	Å	C3 85	LATIN CAPITAL LETTER A WITH RING ABOVE
0xC6	U+00C6	Æ	C3 86	LATIN CAPITAL LETTER AE
0xC7	U+00C7	Ç	C3 87	LATIN CAPITAL LETTER C WITH CEDILLA
0xC8	U+00C8	È	C3 88	LATIN CAPITAL LETTER E WITH GRAVE
0xC9	U+00C9	É	C3 89	LATIN CAPITAL LETTER E WITH ACUTE
0xCA	U+00CA	Ê	C3 8A	LATIN CAPITAL LETTER E WITH CIRCUMFLEX
0xCB	U+00CB	Ë	C3 8B	LATIN CAPITAL LETTER E WITH DIAERESIS
0xCC	U+00CC	Ì	C3 8C	LATIN CAPITAL LETTER I WITH GRAVE
0xCD	U+00CD	Í	C3 8D	LATIN CAPITAL LETTER I WITH ACUTE
0xCE	U+00CE	Î	C3 8E	LATIN CAPITAL LETTER I WITH CIRCUMFLEX
0xCF	U+00CF	Ï	C3 8F	LATIN CAPITAL LETTER I WITH DIAERESIS
0xD0	U+00D0	Ð	C3 90	LATIN CAPITAL LETTER ETH
0xD1	U+00D1	Ñ	C3 91	LATIN CAPITAL LETTER N WITH TILDE
0xD2	U+00D2	Ò	C3 92	LATIN CAPITAL LETTER O WITH GRAVE
0xD3	U+00D3	Ó	C3 93	LATIN CAPITAL LETTER O WITH ACUTE
0xD4	U+00D4	Ô	C3 94	LATIN CAPITAL LETTER O WITH CIRCUMFLEX
0xD5	U+00D5	Õ	C3 95	LATIN CAPITAL LETTER O WITH TILDE
0xD6	U+00D6	Ö	C3 96	LATIN CAPITAL LETTER O WITH DIAERESIS
0xD7	U+00D7	×	C3 97	MULTIPLICATION SIGN
0xD8	U+00D8	Ø	C3 98	LATIN CAPITAL LETTER O WITH STROKE
0xD9	U+00D9	Ù	C3 99	LATIN CAPITAL LETTER U WITH GRAVE
0xDA	U+00DA	Ú	C3 9A	LATIN CAPITAL LETTER U WITH ACUTE
0xDB	U+00DB	Û	C3 9B	LATIN CAPITAL LETTER U WITH CIRCUMFLEX
0xDC	U+00DC	Ü	C3 9C	LATIN CAPITAL LETTER U WITH DIAERESIS
0xDD	U+00DD	Ý	C3 9D	LATIN CAPITAL LETTER Y WITH ACUTE
0xDE	U+00DE	Þ	C3 9E	LATIN CAPITAL LETTER THORN
0xDF	U+00DF	ß	C3 9F	LATIN SMALL LETTER SHARP S
0xE0	U+00E0	à	C3 A0	LATIN SMALL LETTER A WITH GRAVE
0xE1	U+00E1	á	C3 A1	LATIN SMALL LETTER A WITH ACUTE
0xE2	U+00E2	â	C3 A2	LATIN SMALL LETTER A WITH CIRCUMFLEX
0xE3	U+00E3	ã	C3 A3	LATIN SMALL LETTER A WITH TILDE
0xE4	U+00E4	ä	C3 A4	LATIN SMALL LETTER A WITH DIAERESIS
0xE5	U+00E5	å	C3 A5	LATIN SMALL LETTER A WITH RING ABOVE
0xE6	U+00E6	æ	C3 A6	LATIN SMALL LETTER AE
0xE7	U+00E7	ç	C3 A7	LATIN SMALL LETTER C WITH CEDILLA
0xE8	U+00E8	è	C3 A8	LATIN SMALL LETTER E WITH GRAVE
0xE9	U+00E9	é	C3 A9	LATIN SMALL LETTER E WITH ACUTE
0xEA	U+00EA	ê	C3 AA	LATIN SMALL LETTER E WITH CIRCUMFLEX
0xEB	U+00EB	ë	C3 AB	LATIN SMALL LETTER E WITH DIAERESIS
0xEC	U+00EC	ì	C3 AC	LATIN SMALL LETTER I WITH GRAVE
0xED	U+00ED	í	C3 AD	LATIN SMALL LETTER I WITH ACUTE
0xEE	U+00EE	î	C3 AE	LATIN SMALL LETTER I WITH CIRCUMFLEX
0xEF	U+00EF	ï	C3 AF	LATIN SMALL LETTER I WITH DIAERESIS
0xF0	U+00F0	ð	C3 B0	LATIN SMALL LETTER ETH
0xF1	U+00F1	ñ	C3 B1	LATIN SMALL LETTER N WITH TILDE
0xF2	U+00F2	ò	C3 B2	LATIN SMALL LETTER O WITH GRAVE
0xF3	U+00F3	ó	C3 B3	LATIN SMALL LETTER O WITH ACUTE
0xF4	U+00F4	ô	C3 B4	LATIN SMALL LETTER O WITH CIRCUMFLEX
0xF5	U+00F5	õ	C3 B5	LATIN SMALL LETTER O WITH TILDE
0xF6	U+00F6	ö	C3 B6	LATIN SMALL LETTER O WITH DIAERESIS
0xF7	U+00F7	÷	C3 B7	DIVISION SIGN
0xF8	U+00F8	ø	C3 B8	LATIN SMALL LETTER O WITH STROKE
0xF9	U+00F9	ù	C3 B9	LATIN SMALL LETTER U WITH GRAVE
0xFA	U+00FA	ú	C3 BA	LATIN SMALL LETTER U WITH ACUTE
0xFB	U+00FB	û	C3 BB	LATIN SMALL LETTER U WITH CIRCUMFLEX
0xFC	U+00FC	ü	C3 BC	LATIN SMALL LETTER U WITH DIAERESIS
0xFD	U+00FD	ý	C3 BD	LATIN SMALL LETTER Y WITH ACUTE
0xFE	U+00FE	þ	C3 BE	LATIN SMALL LETTER THORN
0xFF	U+00FF	ÿ	C3 BF	LATIN SMALL LETTER Y WITH DIAERESIS

I get the exact same output whether I decode the input or not. The statement…

$row->{UnicodeCharacter} = decode_utf8($row->{UnicodeCharacter});

…is a no-op. The string returned by the DBI utility function data_string_desc() is always 'UTF8 on, non-ASCII, 1 characters 2 [or 3] bytes', regardless of whether or not I decode the queried text.

So do you see how I could easily be lulled into thinking I shouldn't have to decode strings of text (not binary data!) queried from a Text datatype column in a Microsoft Office Access database that stores all data in Unicode?


In reply to Re^4: Mugged by UTF8, this CANNOT be right by Jim
in thread Mugged by UTF8, this CANNOT be right by tosh

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.