comment on

Although I'll second the suggestion of using the 'Accept-charset' header, I'm not so sure about user agents responding in the same encoding as the page

From RFC 2616 (HTTP/1.0):

3.4 Character Sets

   HTTP uses the same definition of the term "character set" as that
   described for MIME:

   The term "character set" is used in this document to refer to a
   method used with one or more tables to convert a sequence of octets
   into a sequence of characters. Note that unconditional conversion i
+n
   the other direction is not required, in that not all characters may
   be available in a given character set and a character set may provi
+de
   more than one sequence of octets to represent a particular characte
+r.
   This definition is intended to allow various kinds of character
   encoding, from simple single-table mappings such as US-ASCII to
   complex table switching methods such as those that use ISO-2022's
   techniques. However, the definition associated with a MIME characte
+r
   set name MUST fully specify the mapping to be performed from octets
   to characters. In particular, use of external profiling information
   to determine the exact mapping is not permitted.

      Note: This use of the term "character set" is more commonly
      referred to as a "character encoding." However, since HTTP and
      MIME share the same registry, it is important that the terminolo
+gy
      also be shared.

   HTTP character sets are identified by case-insensitive tokens. The
   complete set of tokens is defined by the IANA Character Set registr
+y
   [19].

       charset = token

   Although HTTP allows an arbitrary token to be used as a charset
   value, any token that has a predefined value within the IANA
   Character Set registry [19] MUST represent the character set define
+d
   by that registry. Applications SHOULD limit their use of character
   sets to those defined by the IANA registry.

   Implementors should be aware of IETF character set requirements [38
+]
   [41].

3.4.1 Missing Charset

   Some HTTP/1.0 software has interpreted a Content-Type header withou
+t
   charset parameter incorrectly to mean "recipient should guess."
   Senders wishing to defeat this behavior MAY include a charset
   parameter even when the charset is ISO-8859-1 and SHOULD do so when
   it is known that it will not confuse the recipient.

   Unfortunately, some older HTTP/1.0 clients did not deal properly wi
+th
   an explicit charset parameter. HTTP/1.1 recipients MUST respect the
   charset label provided by the sender; and those user agents that ha
+ve
   a provision to "guess" a charset MUST use the charset from the

   content-type field if they support that charset, rather than the
   recipient's preference, when initially displaying a document. See
   section 3.7.1.
[download]

I'm still not sure how to handle form data in the QUERY_STRING -- from section 2.1 of RFC 2396 (URI Syntax):

   For original character sequences that contain non-ASCII characters,
   however, the situation is more difficult. Internet protocols that
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset used, i
+f
   there might be more than one [RFC2277].  However, there is currentl
+y
   no provision within the generic URI syntax to accomplish this
   identification. An individual URI scheme may require a single
   charset, define a default charset, or provide a way to indicate the
   charset used.

   It is expected that a systematic treatment of character encoding
   within URI will be developed as a future modification of this
   specification.
[download]

(If anyone knows of a followup RFC, I'd love to know what the number is)

And for the original poster, although Joel's article is a good start, it's intended as a quick overview -- I'd also suggest you take a look at A tutorial on character code issues

In reply to Re^2: How do I know what encoding was used for form input? by jhourcle
in thread How do I know what encoding was used for form input? by ehdonhon

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.