in reply to CGI hidden params vs. character encoding

First of all, decode( 'utf8', $untrusted ) is a security issue.

Secondly, UTF8 is a perl-specific encoding. UTF-8 is the actual encoding. It doesn't make sense to tell the browser you're using UTF8 (-encoding => 'utf8').

I haven't pinpointed the problem, but changing UTF8 to UTF-8 throughout fixed the problem.

Replies are listed 'Best First'.
Re^2: CGI hidden params vs. character encoding
by graff (Chancellor) on May 27, 2008 at 22:41 UTC
    First of all, decode( 'utf8', $untrusted ) is a security issue.

    Wouldn't that depend on what you do with the value that you get back from decode()? Also, what would be the remedy? I would expect it's okay to do something like eval { decode( 'UTF-8', $untrusted, Encode::FB_CROAK ) } and check $@, or maybe just pass the return value from decode() through a regex or other test for valid content.

    Secondly, UTF8 is a perl-specific encoding. UTF-8 is the actual encoding.

    I haven't pinpointed the problem, but changing UTF8 to UTF-8 throughout fixed the problem.

    Okay... I had to try twice -- I didn't get all the "utf8" strings changed over to "UTF-8" on the first try, but after I fixed the one I had forgotten ("binmode STDOUT..."), it worked. How strange...

    Thanks!!!

      it worked. How strange...

      I found it strange too. I just clued in what the error is.

      First of all,

      binmode STDOUT, ':utf-8';

      is a no-op, since there's no "utf-8" layer.

      >perl -le"print binmode(STDERR, ':utf8')?1:0" 1 >perl -le"print binmode(STDERR, ':utf-8')?1:0" 0 >perl -le"print binmode(STDERR, ':encoding(utf8)')?1:0" 1 >perl -le"print binmode(STDERR, ':encoding(utf-8)')?1:0" 1

      If we do it properly (:encoding(utf-8)) we end up with your orignal problem.

      Your problem is that you are double-encoding! You're telling CGI to encode your data using UTF8 (-charset => 'utf-8') and then you encode it again using binmode STDOUT, ":utf8";.

      The solution is to get rid of binmode completely and only use CGI's methods to output.

        Your problem is that you are double-encoding! You're telling CGI to encode your data using UTF8 (-charset => 'utf-8') and then you encode it again using binmode STDOUT, ":utf8";.

        But... But... Then why did the double-encoding show up only in that one place?? If the behavior were consistent throughout, I would understand, but I still can't figure out how I got the particular behavior that I did.

        The solution is to get rid of binmode completely and only use CGI's methods to output.

        I'm not sure about that. If I comment out the "binmode STDOUT..." in the OP code (having fixed all other encoding specs to "UTF-8" as described), I get "Wide character in print" warnings showing up in the error log. Also, I don't think I should have to rely entirely on CGI methods for printing content.

Re^2: CGI hidden params vs. character encoding
by graff (Chancellor) on May 28, 2008 at 02:19 UTC
    FINALLY FIGURED IT OUT! (...in a manner of speaking)
    #!/usr/bin/perl -T -w use strict; use CGI qw(:standard); use Encode; binmode STDOUT, ":utf8"; my $cgi = CGI->new(); print $cgi->header(-charset => 'UTF-8'), $cgi->start_html(-title => "Testing hidden-input character encoding", -encoding => 'UTF-8' ), $cgi->start_form; my $parms = $cgi->Vars; if ( $$parms{submit} ) { my $newtest = decode( 'utf8', $$parms{testtext} ); my $newhid = decode( 'utf8', $$parms{testtext_hid} ); delete $$parms{testtext_hid}; ### THIS IS WHAT FIXES THE PROB +LEM print "<p/> The testtext parameter as received was: ". $newtest, "<p/> The hidden parameter was: ". $newhid, $cgi->hidden( "testtext_hid", $newhid ); } else { my $testtext = "\x{444}\x{443}\x{431}\x{430}\x{440}"; print $cgi->textfield( -name => "testtext", -value => $testtext ), $cgi->hidden( "testtext_hid", $testtext ), $cgi->submit(-name => "submit", -value => "submit" ); } print $cgi->end_form, $cgi->end_html;
    I can only guess what might be going on under the covers when CGI sees that it is being given a new value (with the utf8 flag on) that replaces one of the existing parameters already in the "context" of the form (which does not have its utf8 flag on, even though it may already contain valid utf8 data -- it comes from an untrusted source, after all).

    In any case, if I remove the existing parameter from the current "context", the assignment proceeds as expected -- no double encoding.

    All in all, it smells like a bug in CGI, but I'm sufficiently far enough behind in my coding at this point, that I'm happy enough just to know that there is a way to get the intended behavior. Case closed, as far as I'm concerned.

      FINALLY FIGURED IT OUT! (...in a manner of speaking)

      oh, right! Here are two better ways:

      print $cgi->p("The testtext parameter as received was: ". escapeHTML($ +newtest)), $cgi->p("The hidden parameter was: ". escapeHTML($newhid)), $cgi->hidden(-name=>"testtext_hid", -default=>$newhid, -override +=>1);
      $cgi->param('testtext_hid', $newhid); print $cgi->p("The testtext parameter as received was: ". escapeHTML($ +newtest)), $cgi->p("The hidden parameter was: ". escapeHTML($newhid)), $cgi->hidden("testtext_hid");

      Note the -override=>1 in the first snippet, or how the the second snippet sets the parameter instead of setting the default.

      Also note how I did the P elements. <p/> makes no sense. <p/>text<p/>text means <p></p>text<p></p>text but you want <p>text</p><p>text</p>.

      Finally, note how I used escapeHTML to avoid an injection attack and invalid HTML generation.

      All in all, it smells like a bug in CGI

      I agree. Not that you had to use -override, that's clearly documented under the "CREATING FILL-OUT FORMS" header. But how it handles (or rather doesn't handle) encodings other than iso-latin-1.

      Try without Vars(), just using param()
Re^2: CGI hidden params vs. character encoding
by graff (Chancellor) on May 27, 2008 at 23:21 UTC
    Taking another look at the "utf8 security" issue, here's what I'm taking as the "primary reference" (at least, the one here at PM): UTF8 related proof of concept exploit released at T-DOSE.

    The key point, I think, is this:

    Once the UTF8 flag is set, Perl does not check the validity of the UTF8 sequences further. Typically, this is okay, because it was Perl that set the flag in the first place. However, some people set the UTF8 flag manually. They circumvent protection built into encoding/decoding functions and PerlIO layers, either because it's easier (less typing), for performance reasons, or even because they don't know they're doing something wrong.

    This problem is unrelated to the use of "decode()" shown in the OP script here. The "decode()" function is used to take a string (ignoring its utf8 flag) and try to interpret it as a utf8 byte string. Using "decode()" with its default behavior (as shown in the OP), any input bytes that are not interpretable as utf8 data will be replaced by the "?" character, and the result will always be a valid utf8 string (with the utf8 flag set by perl).

    My reading of the exploit is that you only get into trouble when you deliberately twiddle the utf8 flag of a scalar yourself, without checking to see whether it really is fully interpretable as valid utf8 characters. So I would conclude that the OP script is not a case that poses a security problem involving the use of utf8 data.

      This problem is unrelated to the use of "decode()" shown in the OP script here

      You're right. I thought binmode($untrusted_fh, ':utf8') was the same as decode('utf8', $untrusted), but it's the same as _utf8_on($untrusted).