comment on

I tried a perl one-liner to produce some "typical" wide characters:

perl -e 'print join(" ",map {chr()} (0xe0 .. 0xe8)),"\n"'
[download]

I ran that in a variety of different terminal environments, using different (single-byte) character sets, and pasted each of the outputs into the PM text box to create this post. The first output line comes from an iso-8859-1 xterm (i.e. Latin-1):

à á â ã ä å æ ç è

The next was pasted from a utf8 macosx Terminal window (I had to add the "-CS" option on the perl command line, just for this one run, to avoid the "Wide character in print" warning):

à á â ã ä å æ ç è

Same perl one-liner, in a Terminal using iso-8859-5 (Cyrillic -- no "-CS" option):

р с т у ф х ц ч ш

Here's another version of Cyrillic -- koi8:

Ю А Б Ц Д Е Ф Г Х

Here's 8859-3 (Greek):

ΰ α β γ δ ε ζ η θ

And just to be really perverse, here is 8859-6 (Arabic):

ـ ف ق ك ل م ن ه و

My Safari browser's "View->Text Encoding" is set to "Default" (whatever that means), and I was intrigued by the fact that each string of nine characters showed up exactly as intended, appearing exactly the same as in the original terminal that I copied from. Presumably, Safari and macosx are doing some deep magic here, "doing the right thing" with non-ascii character data in accordance with my current terminal setting, but keeping track of everything "under the covers" as unicode characters (otherwise, the Safari text box would not be able to show all those different characters at the same time).

The results from hitting the "preview" button confirms that the characters are being pasted as unicode code points. Curiously, the text box that comes with the preview page shows the non-latin1 strings as numeric character entities (but since I did not put the strings into <code> tags, these entities show up as the intended characters in the main page display). No telling what might happen with Firefox or IE, or whether the behavior of other browsers might depend on your choice of OS. I'll leave that as an exercise... ;)

you cannot really tell that a string is a UTF8 string just by looking at it. (You *might* be able to tell that it is *not* one ...)

Actually, when it's a question of recognizing utf8 vs. just about anything else, it's not at all hard to determine with confidence that "it's definitely utf8" or "it's definitely not utf8". Encode::Guess is good for making this distinction, and it would also do quite well (in most cases) on UTF-16 (BE or LE). There are numeric properties of utf8 that are quite distinctive (very unlikely to occur in other types of data), and UTF16 is a safe bet when you see a regular pattern of null bytes next to 0x0A (line-feed) bytes.

If you have data of indeterminate origin that is clearly not UTF8 or UTF16, and you don't have any external knowledge to give you clues, then it gets a lot harder to figure out what sort of text data you're dealing with -- it can be done, if you have enough known data for each likely language/encoding combination to build good statistical models (probabilities of byte values or byte ngrams), and enough observable "unknown" data in a given language/encoding for Baysian arithmetic to be reliable.

In reply to Re: [not perl] unicode/utf8 in browsers and OS's - where does conversion happen? by graff
in thread [not perl] unicode/utf8 in browsers and OS's - where does conversion happen? by danmcb

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.