UTF-8 or iso-8859-1 input to CGI.pm

thedi has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: UTF-8 or iso-8859-1 input to CGI.pm by moritz (Cardinal) on Mar 02, 2009 at 09:49 UTC
As far as I know there's no reliable way to determine the encoding that was used. What you can do however is to pick one encoding, say UTF-8, and consequently apply it to everything: Character encoding of your pages `charset=` attribute to the HTTP Content-Type header List only UTF-8 in the accept-charset attribute of the input forms Use UTF-8 for URL encoding.	[reply] [d/l]
Re^2: UTF-8 or iso-8859-1 input to CGI.pm by locked_user sundialsvc4 (Abbot) on Mar 02, 2009 at 12:58 UTC
I'm not readily seeing how the character-set used to encode the request is supposed to be an issue. As far as I know, it's UTF-8, with HTML-escapes being used to encode all of the special-characters that you might need. The browser, then, informs the server what character-sets it will accept, whereupon the server either delivers the information as-promised, or informs the client that it cannot be done. I do not profess to be a wizard on this one.
Re^3: UTF-8 or iso-8859-1 input to CGI.pm by moritz (Cardinal) on Mar 02, 2009 at 13:14 UTC
The request itself is always ASCII - that's the spec. However the URL-Encoding scheme with `%DE%AD%BE%EF` encodes only bytes, so you need to pick a character encoding. When I used latin-1 for this encoding some browsers sent me some requests with latin-1 encoded data, even though the pages themselves were encoded in UTF-8 (and declared as such). So I guess they decoded the URLs and did some encoding guesswork, and used that for GET requests. Which is why I recommend consistency ;-)	[reply] [d/l]
Re: UTF-8 or iso-8859-1 input to CGI.pm by rlucas (Scribe) on Mar 03, 2009 at 01:58 UTC
This is a good question and the answer is non-trivial to comprehend. I recommend taking a look at this page, which is a mirror of a (now-defunct but highly useful) original for background: http://niwo.mnsys.org/saved/~flavell/charset/form-i18n.html In short, you never really know exactly what you're getting back from a browser, and the very best belt-and-suspenders thing you can do is to have a "magic cookie" hidden form field on each form that contains a smattering of odd characters with an unambiguous representation in each encoding you suspect. That is, you put a bunch of CJK and other stuff, output as HTML entities, into your form field, and you use a heuristic based upon what the browser sends back to determine what encoding the browser has decided to use. You can also look at Encode::Guess, which makes some informed guesses based on the content, but doesn't require making the round-trip the way the form field technique does. But what most people do, and what's probably most tractable unless you're trying to solve a problem for a large, diverse user-base, is to figure out what the top 3-4 user-agents you serve do, and just bank on that...	[reply]
Re: UTF-8 or iso-8859-1 input to CGI.pm by jhourcle (Prior) on Mar 02, 2009 at 17:23 UTC
If you don't trust the browser, I'd try to find a named entity that's represented as a different value in each of the two encodings, and set it as a value in a hidden field. If there aren't any such characters, you could at least use a character that's in unicode, but not iso-latin, and see what comes back from the browser.	[reply]
Re: UTF-8 or iso-8859-1 input to CGI.pm by wanradt (Scribe) on Mar 03, 2009 at 10:10 UTC
I have no straight answer to your question. From my experience working with UTF-8 i can say that i had no trouble with browser requests: when i sent page in UTF-8 i got it posts back in UTF-8. But i had lots of troubles with unicode dataflow in Perl itself. Still i suggest certainly use UTF-8 (not old ISO-8859-x). About using UTF-8 i wrote a node 731943, there you can see my problems and some good comments for workarounds. I hope it will give you some additional bits to thought. Nõnda, WK	[reply]
Re: UTF-8 or iso-8859-1 input to CGI.pm by Anonymous Monk on Mar 02, 2009 at 09:32 UTC
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.2	[reply]
Re^2: UTF-8 or iso-8859-1 input to CGI.pm by thedi (Acolyte) on Mar 02, 2009 at 09:59 UTC
This document describes how a browser can request a document in a desired code. This is how a CGI script should encode its response. But I am looking for the encoding of the request. How can a CGI script find out in which encoding a request was send from the browser to the server. This is: how is the CGI input encoded. This is of importance when a request contains form input. Is this form input send utf-8 encoded or iso-8859-1. Rather: how can a CGI.pm based script find out hoe the input was encoded. Thanks Thedi gerber@id.ethz.ch	[reply]
Re^3: UTF-8 or iso-8859-1 input to CGI.pm by Anonymous Monk on Mar 02, 2009 at 10:24 UTC
The RFC specifies how both client/server should behave, and all information is in HTTP headers.	[reply]
Re^4: UTF-8 or iso-8859-1 input to CGI.pm by wol (Hermit) on Mar 02, 2009 at 12:34 UTC
Re^5: UTF-8 or iso-8859-1 input to CGI.pm by Anonymous Monk on Mar 03, 2009 at 10:24 UTC