in reply to Confusing UTF-8 bug in CGI-script

Not sure if it helps, but your script works fine for me.  I.e., when I drop it into the cgi-bin directory of an Apache server, view the page in Firefox, and enter some non-ASCII content into the textarea input field (e.g. by cut-n-pasting something from a Chinese web page), the data is echoed back from the script without an encoding problem.  And if I trace the traffic between browser and web server, it's encoded as UTF-8, as expected.

The script also works when I comment out the lines use locale; and use open ':std' => ':encoding(UTF-8)'; (which are superfluous at best, IMHO). And the use utf8; line is of course only required if the script source itself is in fact encoded in UTF-8 (the literal characters "öäüõžš¢ð€¶", in this case).

(tested with various versions of CGI.pm from 3.04 to 3.49 — Update: 3.04, 3.15, 3.29, 3.48 and 3.49, to be precise)

Correction: with CGI-3.49/Perl-5.12.2, STDOUT needs to be explicitly declared as UTF-8 (either with binmode STDOUT, ":utf8", or with use open...), otherwise I'm getting warnings "Wide character in print" in the error log.  This is not the case with earlier versions.

Replies are listed 'Best First'.
Re^2: Confusing UTF-8 bug in CGI-script
by ikegami (Patriarch) on Feb 01, 2011 at 18:40 UTC

    The script also works when I comment out the lines use locale; and use open ':std' => ':encoding(UTF-8)'; (which are superfluous at best, IMHO).

    «use locale;» is indeed superfluous since he doesn't do any operations that uses locales (cmp, lc, etc). It's not relevant to the OP's question since it doesn't affect encoding.

    «use open ':std' => ':encoding(UTF-8)';» is not superfluous. Part of what it does is necessary, and the other part of what it does is wrong. Specifically,

    BEGIN { # Wrong, and the cause of the OP's problem. See my reply to the OP. binmode(STDIN, ':encoding(UTF-8)'); # Necessary to encode the returned HTML. binmode(STDOUT, ':encoding(UTF-8)'); # Necessary to encode error messages for the log. binmode(STDERR, ':encoding(UTF-8)'); }

    It could be replaced with the following or something equivalent, but it shouldn't be eliminated.

    BEGIN { binmode(STDIN); # Form data binmode(STDOUT, ':encoding(UTF-8)'); # HTML binmode(STDERR, ':encoding(UTF-8)'); # Error messages }
      # Wrong, and the cause of the OP's problem. See my reply to the OP. binmode(STDIN, ':encoding(UTF-8)');

      That's what I would've thought, too, but interestingly, it doesn't do any harm in practice (I did try it), and

      # Necessary to encode the returned HTML. binmode(STDOUT, ':encoding(UTF-8)');

      only seems to be required with newer versions of CGI.pm (as I mentioned). Older versions apparently did the encoding themselves before printing to STDOUT (?)

        it doesn't do any harm in practice

        I don't know how you can say that after saying yourself that removing it also fixes the OP's problem.

        Update: Well, you said that removing use open fixes the issue, but I doubt you're claiming that binmoding output handles leads to a decoding error, so that leaves the binmoding of the input handle.