Although I'll second the suggestion of using the 'Accept-charset' header, I'm not so sure about user agents responding in the same encoding as the page

From RFC 2616 (HTTP/1.0):

3.4 Character Sets HTTP uses the same definition of the term "character set" as that described for MIME: The term "character set" is used in this document to refer to a method used with one or more tables to convert a sequence of octets into a sequence of characters. Note that unconditional conversion i +n the other direction is not required, in that not all characters may be available in a given character set and a character set may provi +de more than one sequence of octets to represent a particular characte +r. This definition is intended to allow various kinds of character encoding, from simple single-table mappings such as US-ASCII to complex table switching methods such as those that use ISO-2022's techniques. However, the definition associated with a MIME characte +r set name MUST fully specify the mapping to be performed from octets to characters. In particular, use of external profiling information to determine the exact mapping is not permitted. Note: This use of the term "character set" is more commonly referred to as a "character encoding." However, since HTTP and MIME share the same registry, it is important that the terminolo +gy also be shared. HTTP character sets are identified by case-insensitive tokens. The complete set of tokens is defined by the IANA Character Set registr +y [19]. charset = token Although HTTP allows an arbitrary token to be used as a charset value, any token that has a predefined value within the IANA Character Set registry [19] MUST represent the character set define +d by that registry. Applications SHOULD limit their use of character sets to those defined by the IANA registry. Implementors should be aware of IETF character set requirements [38 +] [41]. 3.4.1 Missing Charset Some HTTP/1.0 software has interpreted a Content-Type header withou +t charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly wi +th an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that ha +ve a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1.

I'm still not sure how to handle form data in the QUERY_STRING -- from section 2.1 of RFC 2396 (URI Syntax):

For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, i +f there might be more than one [RFC2277]. However, there is currentl +y no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification.

(If anyone knows of a followup RFC, I'd love to know what the number is)

And for the original poster, although Joel's article is a good start, it's intended as a quick overview -- I'd also suggest you take a look at A tutorial on character code issues


In reply to Re^2: How do I know what encoding was used for form input? by jhourcle
in thread How do I know what encoding was used for form input? by ehdonhon

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.