Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Although I'll second the suggestion of using the 'Accept-charset' header, I'm not so sure about user agents responding in the same encoding as the page

From RFC 2616 (HTTP/1.0):

3.4 Character Sets HTTP uses the same definition of the term "character set" as that described for MIME: The term "character set" is used in this document to refer to a method used with one or more tables to convert a sequence of octets into a sequence of characters. Note that unconditional conversion i +n the other direction is not required, in that not all characters may be available in a given character set and a character set may provi +de more than one sequence of octets to represent a particular characte +r. This definition is intended to allow various kinds of character encoding, from simple single-table mappings such as US-ASCII to complex table switching methods such as those that use ISO-2022's techniques. However, the definition associated with a MIME characte +r set name MUST fully specify the mapping to be performed from octets to characters. In particular, use of external profiling information to determine the exact mapping is not permitted. Note: This use of the term "character set" is more commonly referred to as a "character encoding." However, since HTTP and MIME share the same registry, it is important that the terminolo +gy also be shared. HTTP character sets are identified by case-insensitive tokens. The complete set of tokens is defined by the IANA Character Set registr +y [19]. charset = token Although HTTP allows an arbitrary token to be used as a charset value, any token that has a predefined value within the IANA Character Set registry [19] MUST represent the character set define +d by that registry. Applications SHOULD limit their use of character sets to those defined by the IANA registry. Implementors should be aware of IETF character set requirements [38 +] [41]. 3.4.1 Missing Charset Some HTTP/1.0 software has interpreted a Content-Type header withou +t charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly wi +th an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that ha +ve a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1.

I'm still not sure how to handle form data in the QUERY_STRING -- from section 2.1 of RFC 2396 (URI Syntax):

For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, i +f there might be more than one [RFC2277]. However, there is currentl +y no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification.

(If anyone knows of a followup RFC, I'd love to know what the number is)

And for the original poster, although Joel's article is a good start, it's intended as a quick overview -- I'd also suggest you take a look at A tutorial on character code issues


In reply to Re^2: How do I know what encoding was used for form input? by jhourcle
in thread How do I know what encoding was used for form input? by ehdonhon

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-20 03:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found