Re^3: Strange letters ... (clients)

in reply to Re^2: Strange letters ...
in thread Strange letters ...

Assuming it's not a client side bug, it looks like PerlMonks outputs UTF-8 while claiming to output cp1252 in some circumstances.

PerlMonks certainly may, under the circumstance of somebody submitting UTF-8 to be displayed. While HTTP has very good ways of clearly saying what encoding is being downloaded, it rather lacks in clarity in how a site is supposed to tell the client what encoding it would like things uploaded in and in how clients tell the server what encoding the stuff they are uploading is in.

The state of the art on those two points appears to often come down to... guessing. Clients tend to guess that servers want stuff uploaded in the same encoding that the server used to present the form that offered the opportunity to upload. Servers tend to guess that uploads are done in the encoding that they tend to produce while some also look for byte sequences that seem likely in UTF-8 and then guess that what is being uploaded is UTF-8. There are also special Unicode escapes for URL-encoded data that servers can notice (however, a URL being in "escaped Unicode" doesn't necessarily say anything about any other parts of the upload).

And there are cases where this is especially likely to go wrong at PerlMonks. If the client (like most) guesses that Windows-1252 is desired because a page at PerlMonks proclaims itself to be in Windows-1252 (like most of them do, but not all of them, at this point), then the client has to make yet another guess if the data to be submitted contains a character that is not covered by Windows-1252. Most clients, IME, guess that the way to deal with this is to HTML-escape the problem character using an HTML entity. And at PerlMonks, in many cases, that is correct. moritz noted that in the case of text inside of <code> tags, that guess fails (but he was incorrect in just proclaiming that HTML escaping is always what clients choose to use). It also fails for node titles. Some clients instead guess that the server might not be expecting HTML and opt to send the submission in UTF-8 so that they can include the troublesome character (probably guessing that the server will notice the typical pattern of UTF-8 bytes and guess correctly). Some clients guess that perhaps neither route will work and just send '?' for the troublesome character.

I only have a vague recollection of the last time I heard of somebody looking at how the PerlMonks server guesses about encoding of uploads. But that vague memory says that PerlMonks notices Unicode escapes in URLs and doesn't notice UTF-8-like byte sequences and never guesses "UTF-8" about encoding of anything other than URL-encoded data.

It used to be worse when PerlMonks claimed Latin-1 encoding when it was actually just re-sending out whatever bytes people were sending to it. Some Windows users would send bytes that represent characters in Windows-1252 but not in Latin-1. When in a node title, some Unix clients would try to deal with these strictly-speaking "illegal" bytes in interesting ways. Some would actually assume that the byte was really meant to be the Windows-1252 character despite the declared Latin-1 encoding. But then they would refuse to lower themselves to respond in kind and would struggle with what to do when asked to send back that byte. I was particularly amused to see some sending the UTF-8 encoding of the character (which demonstrates a certain kind of "double think" to my eye).

A much better solution that I've suggested but I have not (nor has anybody else) implemented at PerlMonks is to include a 'hidden' field in each of our forms where the value of that field always contains a character/byte with the eighth bit set. Then we can quite deterministically determine whether or not the client is uploading in UTF-8 or not.

More likely, we'll just convert all of our content to UTF-8, mostly so we can include the interesting characters inside of <code> tags, especially for Perl 6 code (once PerlMonks starts declaring all of our pages as being UTF-8, pretty much every client will always upload to us in UTF-8).

But, actually, that probably has nothing to do with what you have observed.

I have observed nodes that contain 8-bit characters rarely rendering incorrectly. When this has happened, it is rather random whether a refresh will be correctly rendered or not. I believe such strangeness (based in part on other, similar cases of strangeness) is actually due to bugs in Perl and/or Apache, that sometimes eventually manifest when a single process with a single Perl interpreter instance have managed to serve up a few hundred/thousand web pages. We'll get a few children having one of these problems and a refresh will sometimes hit a confused child and sometimes not. Restarting the web server makes the problem impossible to reproduce again. Waiting quite a while also usually ends with the problem just disappearing again.

- tye

Comment on Re^3: Strange letters ... (clients) Select or Download Code

Replies are listed 'Best First'.
Re^4: Strange letters ... (clients) by Anonymous Monk on Jul 25, 2009 at 09:11 UTC
"-//W3C//DTD HTML 4.0 Transitional//EN" http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset `<form ... accept-charset="windows-1252">` [download] http://www.iana.org/assignments/charset-reg/windows-1252	[reply] [d/l]

In Section Perl Monks Discussion