http://qs1969.pair.com?node_id=482738

ehdonhon has asked for the wisdom of the Perl Monks concerning the following question:

Greetings monks,

I'm on a quest to try to grok the entire concept of Unicode and encodings. I think I'm starting to get things down, but I've got a question as to how this all works with CGI.pm

I just read this article from Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) One of the key points seems to be that:

"It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly."

So my question is, whenever somebody enters some UTF-8 text into one of my form fields and clicks submit, I know I can call $cgi->param( 'field1' ) and get back the data that was submitted, but how do I find out that the data is encoded in UTF-8? If Joel is correct, then it would seem the data is pretty useless without that information.

Thanks.

Replies are listed 'Best First'.
Re: How do I know what encoding was used for form input?
by sgifford (Prior) on Aug 10, 2005 at 22:01 UTC
Re: How do I know what encoding was used for form input?
by itub (Priest) on Aug 10, 2005 at 20:15 UTC
    You can never be 100% sure, but well-behaved user agents usually submit the data in the encoding that was used in the page containing the form. You can also specify which charsets you accept by using the "accept-charset" attribute of the form element. Some user agents might also specify the charset in the Content-Type header for POST requests.

      Although I'll second the suggestion of using the 'Accept-charset' header, I'm not so sure about user agents responding in the same encoding as the page

      From RFC 2616 (HTTP/1.0):

      I'm still not sure how to handle form data in the QUERY_STRING -- from section 2.1 of RFC 2396 (URI Syntax):

      (If anyone knows of a followup RFC, I'd love to know what the number is)

      And for the original poster, although Joel's article is a good start, it's intended as a quick overview -- I'd also suggest you take a look at A tutorial on character code issues

        That's right, there's no real standard way of telling a client how the URI for a GET should be encoded (and even if there is for POST, it seems most clients don't comply). However, practical experience with mainstream browsers lead to this conclusion (http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html).
        By now (2005) the robust way to deal with this issue is to send out forms pages encoded in utf-8, expecting the forms input to be submitted back using that encoding. This has been in practical use for a couple of years now (e.g at Google) and can be expected to work with any current HTML4-compatible browser. However, there are other browsers still in use which don't fit this description, so it still seems relevant to look at the theory and compare it with observations.

        I've used this approach for several websites and it works with all the (reasonably recent) browsers I've tested.

        "In theory, theory and practice are the same, but in practice, they never are."

Re: How do I know what encoding was used for form input?
by kutsu (Priest) on Aug 10, 2005 at 22:39 UTC

    Not directly having to do with your question but a good disscusion of that article exist in Programmers, script languages, and Unicode.

    "Cogito cogito ergo cogito sum - I think that I think, therefore I think that I am." Ambrose Bierce

Re: How do I know what encoding was used for form input?
by InfiniteLoop (Hermit) on Aug 10, 2005 at 20:57 UTC
    One thing you must keep in mind is that the data (other than ascii) is multi byte. Hence any string operation (such as slicing) will corrupt the data, if you don't take care. For more info read:
    1. perluniintro
    2. perlunicode
Re: How do I know what encoding was used for form input?
by pg (Canon) on Aug 11, 2005 at 00:17 UTC

    It might become possible for your program to find out the encoding. If:

    1. Eventually we have some breakthrough and computer starts to understand natural language; (so it tries different encodings, and picks the one that produce meaningful result.)
    2. We assume the content is actually meanginngful.