[SOLVED] How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8?

Cody Fendant has asked for the wisdom of the Perl Monks concerning the following question:

Here's the situation:

I have an HTML form. When its contents are submitted to a perl script, any non-ASCII characters come out wrongly encoded, for instance if I type 'I like Beyoncé' it comes out in my script as 'I like BeyoncĂŠ'

The script does some other stuff, AJAX was involved at one point, but I have now reduced it to a simple (POST or GET) request to this script:

use CGI::Simple;
my $q = CGI::Simple->new();
print "Content-type: text/plain\n\n";
print $q->param('text');
[download]

And if the 'text' parameter contains anything double-byte/UTF-8, I get that output.

Things I have tried:

Making sure the page with the originating form is set to UTF-8 in the HTML
Making sure the server which the form and the perl script live on sends the UTF-8 content-type by default (two restarts!)
Changing the form to specify charset in this HTML attribute: accept-charset="UTF-8"
Adding an explicit charset to CGI::Simple, see here
EDITED TO ADD: adding binmode STDOUT, ":utf8"; to the CGI script

But still the problem persists. What else can I do?

Confusingly, this page: https://www.i18nqa.com/debug/utf8-debug.html seems to suggest my errant character, é turning into ĂŠ is caused by Windows-1252 encoding, a.k.a. ISO 8859-1, but this is nothing to do with Windows. It's happening on OS X for a start.

TIA, fellow monks

SOLVED: This is what worked:

use Encode;
print Encode::decode_utf8($q->param('text'))
[download]

I guess I just have to accept that when the text leaves URL A it might be UTF-8 but there's no way to tell page B that, and I will always have to decode it.

Comment on [SOLVED] How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8? Select or Download Code

Replies are listed 'Best First'.
Re: How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8? by haj (Vicar) on Mar 10, 2020 at 22:09 UTC
Most of the steps you did were fine, so just a few remarks: If you configure the server to send an UTF-8 content-type header, then this usually only works for static files like the page where your form resides in. For CGI, you are supposed to provide your own content-type (as you did), and the server must not touch them. As of today, static pages can declare their own encoding in a `meta` element, e.g. `<meta charset="utf-8">` Adding `accept-charset="UTF-8"` is good for clarity (the default value is `"UNKNOWN"`, but not strictly required since browsers are supposed to use the encoding of the containing form if there's no `accept-charset` attribute. I don't see some other steps in your script: Your CGI script does not get any information about the encoding of characters from the request. If the browser sends UTF-8, then you need to decode the parameter accordingly. If you write UTF-8 in the response CGI script, you ought to `print "Content-type: text/plain; charset=utf-8\n\n"`: In HTTP, the default charset is ISO-8859-1. Browsers can't infer the encoding if there's nothing in the Content-type header, so they'll display your two-byte é as two bytes, namely ĂŠ. If you get a "wide character" warning (if you don't, then you forgot to decode), you might also have omitted to declare the encoding of your output stream. Perl also uses a single-byte encoding per default. To print Unicode characters as UTF-8, you need to do something like `binmode STDOUT, ":encoding(UTF-8)";` About diagnostics: Windows Codepage 1252 and ISO8859-1 are different, but quite similar, so there's no way to distinguish between those two in this case.	[reply]
Re^2: How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8? by Cody Fendant (Hermit) on Mar 11, 2020 at 19:28 UTC
As of today, static pages can declare their own encoding in a meta element, e.g. `<meta charset="utf-8">` This is what I mean when I said I declared the charset in the HTML If you write UTF-8 in the response CGI script, you ought to print `"Content-type: text/plain; charset=utf-8\n\n"`: The server had already added the UTF-8 content type in the HTTP headers `binmode STDOUT, ":encoding(UTF-8)";` I had done this but forgot to include it in the code snippet. It still didn't work.	[reply] [d/l] [select]
Re^3: How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8? by haj (Vicar) on Mar 11, 2020 at 21:24 UTC
The server had already added the UTF-8 content type in the HTTP headers Could you please elaborate which server you are using, and how you configure it to modify the content type of a CGI script? You could also verify in your browser which encoding it uses for your `text/plain` response. I also had a look at the source of CGI::Simple and found out: The module will decode parameters for you only if you set the global variable `$CGI::Simple::PARAM_UTF8` to a true value. That's sort of difficult to guess, since it isn't documented. Of course, you can decode yourself, but it looks like you didn't. The module will add `; charset=utf-8` to the content type header only if you print it as `print $q->header(-type => 'text/plain');`, but not if you just `print "Content-type: text/plain\n\n";`. So, the following just works for me: `use strict; use warnings; use CGI::Simple; $CGI::Simple::PARAM_UTF8 = 1; my $q = CGI::Simple->new(); $q->charset('utf-8'); binmode STDOUT,':encoding(UTF-8)'; print $q->header(-type => 'text/plain'); print $q->param('text'),"\n";` [download]	[reply] [d/l]
Re: How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8? by graff (Chancellor) on Mar 11, 2020 at 01:00 UTC
I think you might notice an improvement if you added this line above the print statement: `binmode STDOUT, ":utf8";` [download]	[reply] [d/l]
Re^2: How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8? by Cody Fendant (Hermit) on Mar 11, 2020 at 19:45 UTC
I had done this but forgot to include it in the code snippet. It still didn't work, unfortunately.	[reply]
Re: How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8? (reproducible) by Anonymous Monk on Mar 10, 2020 at 22:13 UTC
Sorry, but ya ain't got nothing detailed enough to be reproducible :\| and PEBKAC ;) `use CGI::Simple; my $q = CGI::Simple->new(); print "Content-type: text/plain\n\n"; print $q->param('text');` [download] Try this `use CGI -utf8; my $q = CGI->new; print $q->header(qw/ -charset UTF-8 / ), $q->start_html, $q->param('unicodes');` [download]	[reply] [d/l] [select]