Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: How do I know what encoding was used for form input?

by jhourcle (Prior)
on Aug 11, 2005 at 13:11 UTC ( [id://482944]=note: print w/replies, xml ) Need Help??


in reply to Re: How do I know what encoding was used for form input?
in thread How do I know what encoding was used for form input?

Although I'll second the suggestion of using the 'Accept-charset' header, I'm not so sure about user agents responding in the same encoding as the page

From RFC 2616 (HTTP/1.0):

3.4 Character Sets HTTP uses the same definition of the term "character set" as that described for MIME: The term "character set" is used in this document to refer to a method used with one or more tables to convert a sequence of octets into a sequence of characters. Note that unconditional conversion i +n the other direction is not required, in that not all characters may be available in a given character set and a character set may provi +de more than one sequence of octets to represent a particular characte +r. This definition is intended to allow various kinds of character encoding, from simple single-table mappings such as US-ASCII to complex table switching methods such as those that use ISO-2022's techniques. However, the definition associated with a MIME characte +r set name MUST fully specify the mapping to be performed from octets to characters. In particular, use of external profiling information to determine the exact mapping is not permitted. Note: This use of the term "character set" is more commonly referred to as a "character encoding." However, since HTTP and MIME share the same registry, it is important that the terminolo +gy also be shared. HTTP character sets are identified by case-insensitive tokens. The complete set of tokens is defined by the IANA Character Set registr +y [19]. charset = token Although HTTP allows an arbitrary token to be used as a charset value, any token that has a predefined value within the IANA Character Set registry [19] MUST represent the character set define +d by that registry. Applications SHOULD limit their use of character sets to those defined by the IANA registry. Implementors should be aware of IETF character set requirements [38 +] [41]. 3.4.1 Missing Charset Some HTTP/1.0 software has interpreted a Content-Type header withou +t charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly wi +th an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that ha +ve a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1.

I'm still not sure how to handle form data in the QUERY_STRING -- from section 2.1 of RFC 2396 (URI Syntax):

For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, i +f there might be more than one [RFC2277]. However, there is currentl +y no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification.

(If anyone knows of a followup RFC, I'd love to know what the number is)

And for the original poster, although Joel's article is a good start, it's intended as a quick overview -- I'd also suggest you take a look at A tutorial on character code issues

Replies are listed 'Best First'.
Re^3: How do I know what encoding was used for form input?
by itub (Priest) on Aug 11, 2005 at 14:57 UTC
    That's right, there's no real standard way of telling a client how the URI for a GET should be encoded (and even if there is for POST, it seems most clients don't comply). However, practical experience with mainstream browsers lead to this conclusion (http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html).
    By now (2005) the robust way to deal with this issue is to send out forms pages encoded in utf-8, expecting the forms input to be submitted back using that encoding. This has been in practical use for a couple of years now (e.g at Google) and can be expected to work with any current HTML4-compatible browser. However, there are other browsers still in use which don't fit this description, so it still seems relevant to look at the theory and compare it with observations.

    I've used this approach for several websites and it works with all the (reasonably recent) browsers I've tested.

    "In theory, theory and practice are the same, but in practice, they never are."

      Thanks for the reference -- I know sgifford had given it as well, but he seemed to just be quoting it, rather than mentioning the information it contained.

      I hadn't seen the 'buzzword' concept presented before, but it seems like a simple hack to validate what's being sent back to you.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://482944]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-03-29 06:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found