thedi has asked for the wisdom of the Perl Monks concerning the following question:

I write web applications based on CGI.pm. Up to now we always used iso-8859-1. Now we want to change to utf-8 encoded dialogs.

Everything works as desired, but I feel a bit uncertain about one point:

How can my application check which code was used by the browser to send form input?

I have the impression that browsers will send form input utf-8 encoded when the HTML file containing the form was encoded in utf-8, and iso-8859-1 when the HTML file was encoded as such.

But it would be nice to have more than just 'an impression'.

Is there a way in CGI.pm to find out which code was used in the request?

Regards

Thedi gerber@id.ethz.ch

Replies are listed 'Best First'.
Re: UTF-8 or iso-8859-1 input to CGI.pm
by moritz (Cardinal) on Mar 02, 2009 at 09:49 UTC
    As far as I know there's no reliable way to determine the encoding that was used.

    What you can do however is to pick one encoding, say UTF-8, and consequently apply it to everything:

    • Character encoding of your pages
    • charset= attribute to the HTTP Content-Type header
    • List only UTF-8 in the accept-charset attribute of the input forms
    • Use UTF-8 for URL encoding.

      I'm not readily seeing how the character-set used to encode the request is supposed to be an issue. As far as I know, it's UTF-8, with HTML-escapes being used to encode all of the special-characters that you might need.

      The browser, then, informs the server what character-sets it will accept, whereupon the server either delivers the information as-promised, or informs the client that it cannot be done.

      I do not profess to be a wizard on this one.

        The request itself is always ASCII - that's the spec.

        However the URL-Encoding scheme with %DE%AD%BE%EF encodes only bytes, so you need to pick a character encoding.

        When I used latin-1 for this encoding some browsers sent me some requests with latin-1 encoded data, even though the pages themselves were encoded in UTF-8 (and declared as such). So I guess they decoded the URLs and did some encoding guesswork, and used that for GET requests.

        Which is why I recommend consistency ;-)

Re: UTF-8 or iso-8859-1 input to CGI.pm
by rlucas (Scribe) on Mar 03, 2009 at 01:58 UTC
    This is a good question and the answer is non-trivial to comprehend.

    I recommend taking a look at this page, which is a mirror of a (now-defunct but highly useful) original for background: http://niwo.mnsys.org/saved/~flavell/charset/form-i18n.html

    In short, you never *really* know exactly what you're getting back from a browser, and the very best belt-and-suspenders thing you can do is to have a "magic cookie" hidden form field on each form that contains a smattering of odd characters with an unambiguous representation in each encoding you suspect. That is, you put a bunch of CJK and other stuff, output as HTML entities, into your form field, and you use a heuristic based upon what the browser sends back to determine what encoding the browser has decided to use.

    You can also look at Encode::Guess, which makes some informed guesses based on the content, but doesn't require making the round-trip the way the form field technique does.

    But what most people do, and what's probably most tractable unless you're trying to solve a problem for a large, diverse user-base, is to figure out what the top 3-4 user-agents you serve do, and just bank on that...

Re: UTF-8 or iso-8859-1 input to CGI.pm
by jhourcle (Prior) on Mar 02, 2009 at 17:23 UTC

    If you don't trust the browser, I'd try to find a named entity that's represented as a different value in each of the two encodings, and set it as a value in a hidden field.

    If there aren't any such characters, you could at least use a character that's in unicode, but not iso-latin, and see what comes back from the browser.

Re: UTF-8 or iso-8859-1 input to CGI.pm
by wanradt (Scribe) on Mar 03, 2009 at 10:10 UTC
    I have no straight answer to your question. From my experience working with UTF-8 i can say that i had no trouble with browser requests: when i sent page in UTF-8 i got it posts back in UTF-8. But i had lots of troubles with unicode dataflow in Perl itself. Still i suggest certainly use UTF-8 (not old ISO-8859-x). About using UTF-8 i wrote a node 731943, there you can see my problems and some good comments for workarounds. I hope it will give you some additional bits to thought.

    Nġnda, WK
Re: UTF-8 or iso-8859-1 input to CGI.pm
by Anonymous Monk on Mar 02, 2009 at 09:32 UTC
      This document describes how a browser can request a document in a desired code. This is how a CGI script should encode its response.

      But I am looking for the encoding of the request. How can a CGI script find out in which encoding a request was send from the browser to the server. This is: how is the CGI input encoded.

      This is of importance when a request contains form input. Is this form input send utf-8 encoded or iso-8859-1.

      Rather: how can a CGI.pm based script find out hoe the input was encoded.

      Thanks

      Thedi gerber@id.ethz.ch

        The RFC specifies how both client/server should behave, and all information is in HTTP headers.