in reply to Re: CGI hidden params vs. character encoding
in thread CGI hidden params vs. character encoding

First of all, decode( 'utf8', $untrusted ) is a security issue.

Wouldn't that depend on what you do with the value that you get back from decode()? Also, what would be the remedy? I would expect it's okay to do something like eval { decode( 'UTF-8', $untrusted, Encode::FB_CROAK ) } and check $@, or maybe just pass the return value from decode() through a regex or other test for valid content.

Secondly, UTF8 is a perl-specific encoding. UTF-8 is the actual encoding.

I haven't pinpointed the problem, but changing UTF8 to UTF-8 throughout fixed the problem.

Okay... I had to try twice -- I didn't get all the "utf8" strings changed over to "UTF-8" on the first try, but after I fixed the one I had forgotten ("binmode STDOUT..."), it worked. How strange...

Thanks!!!

Replies are listed 'Best First'.
Re^3: CGI hidden params vs. character encoding
by ikegami (Patriarch) on May 27, 2008 at 23:31 UTC

    it worked. How strange...

    I found it strange too. I just clued in what the error is.

    First of all,

    binmode STDOUT, ':utf-8';

    is a no-op, since there's no "utf-8" layer.

    >perl -le"print binmode(STDERR, ':utf8')?1:0" 1 >perl -le"print binmode(STDERR, ':utf-8')?1:0" 0 >perl -le"print binmode(STDERR, ':encoding(utf8)')?1:0" 1 >perl -le"print binmode(STDERR, ':encoding(utf-8)')?1:0" 1

    If we do it properly (:encoding(utf-8)) we end up with your orignal problem.

    Your problem is that you are double-encoding! You're telling CGI to encode your data using UTF8 (-charset => 'utf-8') and then you encode it again using binmode STDOUT, ":utf8";.

    The solution is to get rid of binmode completely and only use CGI's methods to output.

      Your problem is that you are double-encoding! You're telling CGI to encode your data using UTF8 (-charset => 'utf-8') and then you encode it again using binmode STDOUT, ":utf8";.

      But... But... Then why did the double-encoding show up only in that one place?? If the behavior were consistent throughout, I would understand, but I still can't figure out how I got the particular behavior that I did.

      The solution is to get rid of binmode completely and only use CGI's methods to output.

      I'm not sure about that. If I comment out the "binmode STDOUT..." in the OP code (having fixed all other encoding specs to "UTF-8" as described), I get "Wide character in print" warnings showing up in the error log. Also, I don't think I should have to rely entirely on CGI methods for printing content.

        But... But... Then why did the double-encoding show up only in that one place??

        Because the rest were ASCII characters.

        use Encode qw( encode ); $str = '<p>foo</p>'; for (1..5) { print("$str\n"); $str = encode('UTF-8', $str); }
        <p>foo</p> <p>foo</p> <p>foo</p> <p>foo</p> <p>foo</p>

        I'm not sure about that. If I comment out the "binmode STDOUT..." in the OP code (having fixed all other encoding specs to "UTF-8" as described), I get "Wide character in print" warnings showing up in the error log

        ARGH! CGI doesn't seem to be encoding. What's -charset for, then!? I need to look into this more.

        By the way, <p/> makes no sense. <p/>text<p/>text means <p></p>text<p></p>text but you want <p>text</p><p>text</p> is what you want.