Text Encoding on this site's HTML

The pages on this site are marked as being the Latin-1 character set. Increasingly, though, we are seeing UTF-8 being pasted into code listings.

The <code> blocks are immune from & expansion by design, so you can't just code HTML entities for funny chars.

So... why can't this site do it for us? We could have a <code utf-8> block and a <code Windows> block, etc. The display formatting logic would always turn chars beyond basic ASCII into named entities or Unicode entities, so it displays properly regardless of the browser's setting (or, convert to match what the page's carset is stated to be for characters in that character set).

A variation would be to have some other attribute mark in the opening <code> tag to indicate that some escape character is used in the code block, so we could write such things if we wanted to.

I think a smart default would work, too. If a code block contains characters that are beyond 127 and are legal UTF-8 encodings, it could assume (by default) that it is in fact UTF-8 and convert them to entities. If that's not correct, it would show in the preview window. Getting it wrong is no worse than the current situation with forgetting to escape out square brackets.

I think changing the sent HTML to UTF-8 is not a solution, since we would continue to have both 8-bit characters and UTF-8 pasted into input fields. The solution is to allow either for input.

Comment on Text Encoding on this site's HTML

Replies are listed 'Best First'.
Re: Text Encoding on this site's HTML by grantm (Parson) on Dec 24, 2002 at 04:27 UTC
I'm not opposed to interim solutions, but we should be working towards using UTF-8 exclusively. The Latin-1 character set is OK for western languages but no good for eastern european or asian languages. Win-Latin-1 (CP1252) is a stupid hack. UTF-8 is inclusive and easy - as long as the tools support it. Perhaps the input forms could offer a menu choice for the input encoding and everything could be converted to UTF-8 on input. Then all output could simply be sent as UTF-8, browsers have supported it for quite some time.	[reply]
Re: Re: Text Encoding on this site's HTML by theorbtwo (Prior) on Dec 24, 2002 at 06:26 UTC
The correct thing to do is probably to look at what content-encoding headers the browser throws at us, and transcode into UTF-8 on the server. Additionaly, we should set the accept-charset to "UTF-8". Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).	[reply]