Cody Fendant has asked for the wisdom of the Perl Monks concerning the following question:

Here's the situation:

I have an HTML form. When its contents are submitted to a perl script, any non-ASCII characters come out wrongly encoded, for instance if I type 'I like Beyoncé' it comes out in my script as 'I like Beyoncé'

The script does some other stuff, AJAX was involved at one point, but I have now reduced it to a simple (POST or GET) request to this script:

use CGI::Simple; my $q = CGI::Simple->new(); print "Content-type: text/plain\n\n"; print $q->param('text');

And if the 'text' parameter contains anything double-byte/UTF-8, I get that output.

Things I have tried:

But still the problem persists. What else can I do?

Confusingly, this page: https://www.i18nqa.com/debug/utf8-debug.html seems to suggest my errant character, é turning into é is caused by Windows-1252 encoding, a.k.a. ISO 8859-1, but this is nothing to do with Windows. It's happening on OS X for a start.

TIA, fellow monks

SOLVED: This is what worked:

use Encode; print Encode::decode_utf8($q->param('text'))

I guess I just have to accept that when the text leaves URL A it might be UTF-8 but there's no way to tell page B that, and I will always have to decode it.

  • Comment on [SOLVED] How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8?
  • Select or Download Code

Replies are listed 'Best First'.
Re: How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8?
by haj (Vicar) on Mar 10, 2020 at 22:09 UTC

    Most of the steps you did were fine, so just a few remarks:

    • If you configure the server to send an UTF-8 content-type header, then this usually only works for static files like the page where your form resides in. For CGI, you are supposed to provide your own content-type (as you did), and the server must not touch them. As of today, static pages can declare their own encoding in a meta element, e.g. <meta charset="utf-8">
    • Adding accept-charset="UTF-8" is good for clarity (the default value is "UNKNOWN", but not strictly required since browsers are supposed to use the encoding of the containing form if there's no accept-charset attribute.
    I don't see some other steps in your script:
    • Your CGI script does not get any information about the encoding of characters from the request. If the browser sends UTF-8, then you need to decode the parameter accordingly.
    • If you write UTF-8 in the response CGI script, you ought to print "Content-type: text/plain; charset=utf-8\n\n": In HTTP, the default charset is ISO-8859-1. Browsers can't infer the encoding if there's nothing in the Content-type header, so they'll display your two-byte é as two bytes, namely é.
    • If you get a "wide character" warning (if you don't, then you forgot to decode), you might also have omitted to declare the encoding of your output stream. Perl also uses a single-byte encoding per default. To print Unicode characters as UTF-8, you need to do something like binmode STDOUT, ":encoding(UTF-8)";

    About diagnostics: Windows Codepage 1252 and ISO8859-1 are different, but quite similar, so there's no way to distinguish between those two in this case.

      As of today, static pages can declare their own encoding in a meta element, e.g. <meta charset="utf-8">

      This is what I mean when I said I declared the charset in the HTML

      If you write UTF-8 in the response CGI script, you ought to print "Content-type: text/plain; charset=utf-8\n\n":

      The server had already added the UTF-8 content type in the HTTP headers

      binmode STDOUT, ":encoding(UTF-8)";

      I had done this but forgot to include it in the code snippet. It still didn't work.

        The server had already added the UTF-8 content type in the HTTP headers

        Could you please elaborate which server you are using, and how you configure it to modify the content type of a CGI script? You could also verify in your browser which encoding it uses for your text/plain response.

        I also had a look at the source of CGI::Simple and found out:

        • The module will decode parameters for you only if you set the global variable $CGI::Simple::PARAM_UTF8 to a true value. That's sort of difficult to guess, since it isn't documented. Of course, you can decode yourself, but it looks like you didn't.
        • The module will add ; charset=utf-8 to the content type header only if you print it as print $q->header(-type => 'text/plain');, but not if you just print "Content-type: text/plain\n\n";.
        So, the following just works for me:
        use strict; use warnings; use CGI::Simple; $CGI::Simple::PARAM_UTF8 = 1; my $q = CGI::Simple->new(); $q->charset('utf-8'); binmode STDOUT,':encoding(UTF-8)'; print $q->header(-type => 'text/plain'); print $q->param('text'),"\n";
Re: How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8?
by graff (Chancellor) on Mar 11, 2020 at 01:00 UTC
    I think you might notice an improvement if you added this line above the print statement:
    binmode STDOUT, ":utf8";

      I had done this but forgot to include it in the code snippet. It still didn't work, unfortunately.

Re: How do I convince my Perl script that UTF-8 from an HTML form really is UTF-8? (reproducible)
by Anonymous Monk on Mar 10, 2020 at 22:13 UTC

    Sorry, but ya ain't got nothing detailed enough to be reproducible :|

    and PEBKAC ;)

    use CGI::Simple; my $q = CGI::Simple->new(); print "Content-type: text/plain\n\n"; print $q->param('text');

    Try this

    use CGI -utf8; my $q = CGI->new; print $q->header(qw/ -charset UTF-8 / ), $q->start_html, $q->param('unicodes');