Although I'll second the suggestion of using the 'Accept-charset' header, I'm not so sure about user agents responding in the same encoding as the page
From RFC 2616 (HTTP/1.0):
3.4 Character Sets HTTP uses the same definition of the term "character set" as that described for MIME: The term "character set" is used in this document to refer to a method used with one or more tables to convert a sequence of octets into a sequence of characters. Note that unconditional conversion i +n the other direction is not required, in that not all characters may be available in a given character set and a character set may provi +de more than one sequence of octets to represent a particular characte +r. This definition is intended to allow various kinds of character encoding, from simple single-table mappings such as US-ASCII to complex table switching methods such as those that use ISO-2022's techniques. However, the definition associated with a MIME characte +r set name MUST fully specify the mapping to be performed from octets to characters. In particular, use of external profiling information to determine the exact mapping is not permitted. Note: This use of the term "character set" is more commonly referred to as a "character encoding." However, since HTTP and MIME share the same registry, it is important that the terminolo +gy also be shared. HTTP character sets are identified by case-insensitive tokens. The complete set of tokens is defined by the IANA Character Set registr +y [19]. charset = token Although HTTP allows an arbitrary token to be used as a charset value, any token that has a predefined value within the IANA Character Set registry [19] MUST represent the character set define +d by that registry. Applications SHOULD limit their use of character sets to those defined by the IANA registry. Implementors should be aware of IETF character set requirements [38 +] [41]. 3.4.1 Missing Charset Some HTTP/1.0 software has interpreted a Content-Type header withou +t charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly wi +th an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that ha +ve a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1.
I'm still not sure how to handle form data in the QUERY_STRING -- from section 2.1 of RFC 2396 (URI Syntax):
For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, i +f there might be more than one [RFC2277]. However, there is currentl +y no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification.
(If anyone knows of a followup RFC, I'd love to know what the number is)
And for the original poster, although Joel's article is a good start, it's intended as a quick overview -- I'd also suggest you take a look at A tutorial on character code issues
In reply to Re^2: How do I know what encoding was used for form input?
by jhourcle
in thread How do I know what encoding was used for form input?
by ehdonhon
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |