Confusing UTF-8 bug in CGI-script

wanradt has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Confusing UTF-8 bug in CGI-script by Corion (Patriarch) on Feb 01, 2011 at 15:32 UTC
`use locale;` The ultimate secret to locales: "Avoid" My guess is that your "web user" is not running under whatever "locale" you assume. I would specify the script encoding explicitly using encoding: `use encoding "greek"; # or whatever` [download]	[reply] [d/l] [select]
Re^2: Confusing UTF-8 bug in CGI-script by wanradt (Scribe) on Feb 01, 2011 at 15:50 UTC
Thank, you! Locale is evil, once i wrote even node about it (any use of 'use locale'?), but i don't blame it here. At least commenting it out does have no effect. Nõnda, WK	[reply]
Re^3: Confusing UTF-8 bug in CGI-script by Corion (Patriarch) on Feb 01, 2011 at 15:52 UTC
So, are you sure that your script is encoded as UTF-8? Because that's what you tell Perl: `use utf8;` [download] If it is not encoded as that, that's likely what leads to these decoding errors and/or the crashes.	[reply] [d/l]
Re^4: Confusing UTF-8 bug in CGI-script by wanradt (Scribe) on Feb 01, 2011 at 16:10 UTC
Re: Confusing UTF-8 bug in CGI-script by Anonyrnous Monk (Hermit) on Feb 01, 2011 at 16:13 UTC
Not sure if it helps, but your script works fine for me. I.e., when I drop it into the `cgi-bin` directory of an Apache server, view the page in Firefox, and enter some non-ASCII content into the textarea input field (e.g. by cut-n-pasting something from a Chinese web page), the data is echoed back from the script without an encoding problem. And if I trace the traffic between browser and web server, it's encoded as UTF-8, as expected. The script also works when I comment out the lines `use locale;` and `use open ':std' => ':encoding(UTF-8)';` (which are superfluous at best, IMHO). And the `use utf8;` line is of course only required if the script source itself is in fact encoded in UTF-8 (the literal characters "öäüõžš¢ð€¶", in this case). (tested with various versions of CGI.pm from 3.04 to 3.49 — Update: 3.04, 3.15, 3.29, 3.48 and 3.49, to be precise) Correction: with CGI-3.49/Perl-5.12.2, STDOUT needs to be explicitly declared as UTF-8 (either with `binmode STDOUT, ":utf8"`, or with `use open...`), otherwise I'm getting warnings "Wide character in print" in the error log. This is not the case with earlier versions.	[reply] [d/l] [select]
Re^2: Confusing UTF-8 bug in CGI-script by ikegami (Patriarch) on Feb 01, 2011 at 18:40 UTC
The script also works when I comment out the lines `use locale;` and `use open ':std' => ':encoding(UTF-8)';` (which are superfluous at best, IMHO). «`use locale;`» is indeed superfluous since he doesn't do any operations that uses locales (`cmp`, `lc`, etc). It's not relevant to the OP's question since it doesn't affect encoding. «`use open ':std' => ':encoding(UTF-8)';`» is not superfluous. Part of what it does is necessary, and the other part of what it does is wrong. Specifically, `BEGIN { # Wrong, and the cause of the OP's problem. See my reply to the OP. binmode(STDIN, ':encoding(UTF-8)'); # Necessary to encode the returned HTML. binmode(STDOUT, ':encoding(UTF-8)'); # Necessary to encode error messages for the log. binmode(STDERR, ':encoding(UTF-8)'); }` [download] It could be replaced with the following or something equivalent, but it shouldn't be eliminated. `BEGIN { binmode(STDIN); # Form data binmode(STDOUT, ':encoding(UTF-8)'); # HTML binmode(STDERR, ':encoding(UTF-8)'); # Error messages }` [download]	[reply] [d/l] [select]
Re^3: Confusing UTF-8 bug in CGI-script by Anonyrnous Monk (Hermit) on Feb 01, 2011 at 18:55 UTC
`# Wrong, and the cause of the OP's problem. See my reply to the OP. binmode(STDIN, ':encoding(UTF-8)');` [download] That's what I would've thought, too, but interestingly, it doesn't do any harm in practice (I did try it), and `# Necessary to encode the returned HTML. binmode(STDOUT, ':encoding(UTF-8)');` [download] only seems to be required with newer versions of CGI.pm (as I mentioned). Older versions apparently did the encoding themselves before printing to STDOUT (?)	[reply] [d/l] [select]
Re^4: Confusing UTF-8 bug in CGI-script by ikegami (Patriarch) on Feb 01, 2011 at 19:20 UTC
Re^5: Confusing UTF-8 bug in CGI-script by Anonyrnous Monk (Hermit) on Feb 01, 2011 at 19:25 UTC
Some notes below your chosen depth have not been shown here
Re: Confusing UTF-8 bug in CGI-script by ikegami (Patriarch) on Feb 01, 2011 at 18:25 UTC
The problem is that you are decoding STDIN (via `use open`), and STDIN is used to transfer something that isn't text. The solution is to add `binmode(STDIN);` [download] The problem in detail CGI (the protocol) uses STDIN to pass on the document sent in POST requests. In this case, it's an `application/x-www-form-urlencoded` document. Then, CGI (the module) parses that document and decodes the extracted text components. Encode (used by CGI) detects (correctly guesses) that something's wrong and fails with `Cannot decode string with wide characters` [download] Don't decode non-text. Other processing (decoding of "%" escapes) needs to be done before you have the text that needs to be decoded.	[reply] [d/l] [select]
Re^2: Confusing UTF-8 bug in CGI-script by wanradt (Scribe) on Feb 01, 2011 at 20:14 UTC
STDIN is used to transfer something that isn't text. What you mean: isn't text. What else? And how then transfer the text and make perl to understand it is UTF-8 encoded? Strange thing: i have full site running years in UTF-8, every CGI-script has this "use open ':std' => ':encoding(UTF-8)';" at beginning (pretty much the same init block as in this example above), because without it i just did not get anything to work... Now i copied it to another project, stripped down to skeleton and it does not work anymore... It is too mysterious to me. Nõnda, WK	[reply]
Re^3: Confusing UTF-8 bug in CGI-script by ikegami (Patriarch) on Feb 01, 2011 at 20:50 UTC
What you mean: isn't text. The text you typed into your browser is transformed by it as follows: It is encoded using the proper character encoding. Some of it is encoded using percent encoding. The resulting string is joined to others to form a `application/x-www-urlencoded` document. That leaves you something that's no longer your text. The proper inverse of that is: Split the form data into its components. Remove any percent encoding. Remove the character encoding. You're adding an additional step: Remove the character encoding. (XXX) Split the form data into its components. Remove any percent encoding. Remove the character encoding. The fourth step notices something is odd and throws an error. And how then transfer the text and make perl to understand it is UTF-8 encoded? That's what the «`-utf8`» in «`use CGI qw(:all -utf8);`» does. "This makes CGI.pm treat all parameters as UTF-8 strings" by passing them to `decode`.	[reply] [d/l] [select]
Re^4: Confusing UTF-8 bug in CGI-script by wanradt (Scribe) on Feb 01, 2011 at 21:13 UTC
Re^3: Confusing UTF-8 bug in CGI-script by wanradt (Scribe) on Feb 01, 2011 at 20:27 UTC
Huh, at least i found the difference between working production code and script here: in production i initialize CGI-object (inside BEGIN-block) before asking use open to decode STDIN. Seems, it is the significant difference. Still i hope, you could explain, why input from STDIN is not text. Nõnda, WK	[reply]
Re^4: Confusing UTF-8 bug in CGI-script by ikegami (Patriarch) on Feb 01, 2011 at 20:55 UTC
Re: Confusing UTF-8 bug in CGI-script by dirko.van.schalkwyk (Initiate) on Feb 01, 2011 at 17:41 UTC
Hi, The issue is that perl does not know what encoding you are using in your source. The following article has several solutions : http://www.ahinea.com/en/tech/perl-unicode-struggle.html Hope this helps Regards Dirko	[reply]
Re^2: Confusing UTF-8 bug in CGI-script by Corion (Patriarch) on Feb 01, 2011 at 17:53 UTC
Actually, `use utf8` tells Perl that the source code is (intended to be) UTF-8.	[reply]

The problem in detail