Re^2: Character encoding of microns

It's a very good point to make that when trying to work out character encoding problems, you need to know what your display method is doing, as well as what your program is doing. That's why hex dumps of output are so useful (sad, but true).

But it's also worthwhile to understand the "?" output a little better. When any unicode-aware process (whether a perl script, display terminal, browser rendering engine, database client, database server, or whatever) is trying to convert from unicode to some other encoding, the standard default behavior is to replace a unicode character with "?" in case the output encoding does not have a character that maps to the given unicode code point.

When you see "?" in your outputs where you expect to see other characters, the first thing to do is to identify the point in the processing or display where unicode data has been converted to some other encoding.

When data is going the other direction (from some known or assumed "other" encoding), and the conversion process (wherever it is) sees input bytes or byte pairs that are not defined in the mapping table for the given non-unicode character set, it will put one or more "\x{fffd}" (the unicode "replacement character") in place of the uninterpretable parts in its output unicode string.

Comment on Re^2: Character encoding of microns

Replies are listed 'Best First'.
Re^3: Character encoding of microns by joec_ (Scribe) on Feb 09, 2009 at 10:54 UTC
Hi, Am i correct in assuming that the oracle encoding WE8ISO8859P1 is actually ISO-8859-1? In that case, am i also correct in assuming that perl automatically writes data as ISO-8859-1? Even if i decode ('ISO-8859-1',$clob); i still get question marks written for microns. I just tried a little experiment - in Notepad++ i wrote a single micron sign (Alt-0181). That displayed fine when the encoding is ANSI. When i changed it to utf-8, i got a box/splodge. When i open my actual file, and change the encoding from ANSI to utf-8, nothing happens. This is interesting, is it not? This problem is beginning to bug me now :). Any help appreciated. Joe UPDATE--- `clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B +5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with ТЕ in it' conv: 'this is string with Е in it' unix perlio encoding(utf8) utf8 clob: 'this is string with УТЕ in it' conv: 'this is string with ТЕ in it'` [download] That is the output of oshalla's code. It would seem that the first decode as utf8 seems to make it work, as long as you dont binmode stdout. after binmode the strange As start to appear. However, this is fine for this test string. But, my database output still has question marks in place of the micro signs update 2 i wrote a little c# program to grab the output from oracle and write it to a file. This had no problem and worked fine. In perl Binmode on stdout didnt affect anything and neither did `use encoding 'utf8'` any help appreciated guys -- joe --- Eschew obfuscation, espouse eludication!	[reply] [d/l] [select]
Re^4: Character encoding of microns by ikegami (Patriarch) on Feb 10, 2009 at 15:37 UTC
am i also correct in assuming that perl automatically writes data as ISO-8859-1? Not really. Perl outputs using whatever encoding you specify (via `use open`, `binmode` or some other means). If you don't specify, it outputs the internal representation of the string which is either arbitrary bytes of unknown encoding (UTF8 flag off) or a lax variant of UTF-8 called utf8 (UTF8 flag on). If the UTF8 flag is on, you might also get a warning. If you happen to pass iso-latin-1 characters to Perl and you print these out, Perl will output iso-latin-1. But the same goes for any encoding. `# U+00E9 LATIN SMALL LETTER E WITH ACUTE # Second perl outputs iso-8859-1 $ perl -e'use open ":std", ":encoding(iso-8859-1)"; print chr(0x00E9)' + \| perl -e"print <>" \| od -t x1 0000000 e9 0000001 # U+0449 CYRILLIC SMALL LETTER SHCHA # Second perl outputs iso-8859-5 $ perl -e'use open ":std", ":encoding(iso-8859-5)"; print chr(0x0449)' + \| perl -e"print <>" \| od -t x1 0000000 e9 0000001` [download] However, many aspects of Perl will presume the arbitrary bytes of unknown encoding are iso-latin-1. This includes `uc`, regexp character classes such as `\w`, explicit upgrades to utf8 (`utf8::upgrade($_)`), and implicit upgrades to utf8 (`chop( $_ . chr(0x2660) )`).	[reply] [d/l] [select]