Re: [not perl] unicode/utf8 in browsers and OS's

I tried a perl one-liner to produce some "typical" wide characters:

perl -e 'print join(" ",map {chr()} (0xe0 .. 0xe8)),"\n"'
[download]

I ran that in a variety of different terminal environments, using different (single-byte) character sets, and pasted each of the outputs into the PM text box to create this post. The first output line comes from an iso-8859-1 xterm (i.e. Latin-1):

à á â ã ä å æ ç è

The next was pasted from a utf8 macosx Terminal window (I had to add the "-CS" option on the perl command line, just for this one run, to avoid the "Wide character in print" warning):

à á â ã ä å æ ç è

Same perl one-liner, in a Terminal using iso-8859-5 (Cyrillic -- no "-CS" option):

р с т у ф х ц ч ш

Here's another version of Cyrillic -- koi8:

Ю А Б Ц Д Е Ф Г Х

Here's 8859-3 (Greek):

ΰ α β γ δ ε ζ η θ

And just to be really perverse, here is 8859-6 (Arabic):

ـ ف ق ك ل م ن ه و

My Safari browser's "View->Text Encoding" is set to "Default" (whatever that means), and I was intrigued by the fact that each string of nine characters showed up exactly as intended, appearing exactly the same as in the original terminal that I copied from. Presumably, Safari and macosx are doing some deep magic here, "doing the right thing" with non-ascii character data in accordance with my current terminal setting, but keeping track of everything "under the covers" as unicode characters (otherwise, the Safari text box would not be able to show all those different characters at the same time).

The results from hitting the "preview" button confirms that the characters are being pasted as unicode code points. Curiously, the text box that comes with the preview page shows the non-latin1 strings as numeric character entities (but since I did not put the strings into <code> tags, these entities show up as the intended characters in the main page display). No telling what might happen with Firefox or IE, or whether the behavior of other browsers might depend on your choice of OS. I'll leave that as an exercise... ;)

you cannot really tell that a string is a UTF8 string just by looking at it. (You *might* be able to tell that it is *not* one ...)

Actually, when it's a question of recognizing utf8 vs. just about anything else, it's not at all hard to determine with confidence that "it's definitely utf8" or "it's definitely not utf8". Encode::Guess is good for making this distinction, and it would also do quite well (in most cases) on UTF-16 (BE or LE). There are numeric properties of utf8 that are quite distinctive (very unlikely to occur in other types of data), and UTF16 is a safe bet when you see a regular pattern of null bytes next to 0x0A (line-feed) bytes.

If you have data of indeterminate origin that is clearly not UTF8 or UTF16, and you don't have any external knowledge to give you clues, then it gets a lot harder to figure out what sort of text data you're dealing with -- it can be done, if you have enough known data for each likely language/encoding combination to build good statistical models (probabilities of byte values or byte ngrams), and enough observable "unknown" data in a given language/encoding for Baysian arithmetic to be reliable.

Comment on Re: [not perl] unicode/utf8 in browsers and OS's - where does conversion happen? Download Code

Replies are listed 'Best First'.
Re^2: [not perl] unicode/utf8 in browsers and OS's - where does conversion happen? by danmcb (Monk) on Jan 06, 2008 at 12:38 UTC
"There are numeric properties of utf8 that are quite distinctive (very unlikely to occur in other types of data) ..." Well, yes, you can usually say that if something decodes OK as utf8, it probably is utf8. But it will also be a valid chunk of extended ASCII, or any other charset that makes use of all 256 possibilities for each octet (not elegantly put, but I hope you see my point). And probably is not the same as is. Is it really a problem in practice? I'm not sure. Maybe not. Hey, I'm just asking, OK? I like imagining things all going wrong - it's my job ... ;-)	[reply]

Replies are listed 'Best First'.

Re^2: [not perl] unicode/utf8 in browsers and OS's - where does conversion happen?
by danmcb (Monk) on Jan 06, 2008 at 12:38 UTC

"There are numeric properties of utf8 that are quite distinctive (very unlikely to occur in other types of data) ..."

Well, yes, you can usually say that if something decodes OK as utf8, it probably *is* utf8. But it *will* also be a valid chunk of extended ASCII, or any other charset that makes use of all 256 possibilities for each octet (not elegantly put, but I hope you see my point).

And probably is not the same as *is*. Is it really a problem in practice? I'm not sure. Maybe not. Hey, I'm just asking, OK? I like imagining things all going wrong - it's my job ... ;-)

[reply]