CGI hidden params vs. character encoding

graff has asked for the wisdom of the Perl Monks concerning the following question:

Please consider the following little test.cgi script -- it does nothing, except to demonstrate the problem I'm having in passing a "hidden" parameter value from one form/page to the next:

#!/usr/bin/perl -T -w

use strict;
use CGI qw(:standard);
use Encode;

binmode STDOUT, ":utf8";

my $cgi = CGI->new();
print $cgi->header(-charset => 'utf-8'),
      $cgi->start_html(-title => "Testing hidden-input character encod
+ing",
                       -encoding => 'utf8' ),
      $cgi->start_form;

my $parms = $cgi->Vars;
if ( $$parms{submit} ) {
    my $newtest = decode( 'utf8', $$parms{testtext} );
    my $newhid = decode( 'utf8', $$parms{testtext_hid} );
    print "<p/> The testtext parameter as received was: ". $newtest,
          "<p/> The hidden parameter was: ". $newhid,
          $cgi->hidden( "testtext_hid", $newhid );
}
else {
    my $testtext = "\x{444}\x{443}\x{431}\x{430}\x{440}";
    print $cgi->textfield( -name => "testtext", -value => $testtext ),
          $cgi->hidden( "testtext_hid", $testtext ),
          $cgi->submit(-name => "submit", -value => "submit" );
}
print $cgi->end_form, $cgi->end_html;
[download]

(updated to fix indenting)

If the script is installed as "/cgi-bin/test.cgi", the first time you put that url into your browser, here is what you get back (bear in mind that the PM "code" tags are forcing the actual utf8 characters, which happen to be Cyrillic letters, into numeric entities, and this has nothing to do with the problem):

<!DOCTYPE html
    PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-U
+S">
<head>
<title>Testing hidden-input character encoding</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf8" />
</head>
<body>
<form method="post" action="/cgi-bin/test.cgi" enctype="multipart/form
+-data">
<input type="text" name="testtext" value="&#1092;&#1091;&#1073;&#1072;
+&#1088;" /><input type="hidden" name="testtext_hid" value="&#1092;&#1
+091;&#1073;&#1072;&#1088;"  /><input type="submit" name="submit" valu
+e="submit" /></form>
</body>
</html>
[download]

Now, just hit the submit button, and what comes back is all fine, except for this part:

...<input type="hidden" name="testtext_hid" value="б&#132;б&#131;аБаАб
+&#128;" />...
[download]

Note the difference in the value being assigned to the "hidden" parameter ("testtext_hid"). What has happened is that the string of 5 utf8-encoded Russian letters has been treated as a string of 10 single-byte characters (i.e. as non-utf8 Latin-1 data), and each byte has been re-encoded as a two-byte utf8 character.

To clarify, here is the actual byte sequence of the hidden parameter value from the script's initial output, followed by the 4-digit unicode code points that result from treating that byte sequence as utf8 characters:

d1 84  d1 83  d0 b1  d0 b0  d1 80
 0444   0443   0431   0430   0440
[download]

and here is the equivalent detail from the second output (the value being assigned via <input type="hidden"...):

c3 91  c2 84  c3 91  c2 83  c3 90  c2 b1  c3 90  c2 b0  c3 91  c2 80
 00d1   0084   00d1   0083   00d0   00b1   00d0   00b0   00d1   0080
[download]

This has me stumped. It seems to me that there is no difference in how the string value is assigned to the hidden parameter in the two blocks of the script: in both cases, a "known utf8 string" (i.e. with the utf8 flag turned on) is being assigned via the CGI module as the value of the hidden parameter.

So why does it get treated differently in the two cases -- why are there two different versions of the hidden param value -- and more importantly, how can I get CGI to behave as expected? What bonehead simple fact am I missing?

(BTW, I have tried using "is_utf8" on the incoming "$parm{testtext_hid}" value, such that "decode()" would only be used if in fact the param value was not already flagged as a utf8 string; this was overkill, because the appearance of the "$newhid" string as part of the page content is correct when the script is run as posted. Also, the script as posted does not produce any errors or warnings.)

Update: to make things even more puzzling, I amended the troublesome code block as follows:

if ( $$parms{submit} ) {
    my $newtest = decode( 'utf8', $$parms{testtext} );
    my $newhid = decode( 'utf8', $$parms{testtext_hid} );
    print "<p/> The testtext parameter as received was: ". $newtest,
    "<p/> The hidden parameter has been decoded as: ". $newhid, 
    $cgi->hidden( "testtext_hid", $$parms{testtext_hid} );
        # note: changed text content slightly, and used "raw" hidden v
+alue (not "decoded" value)
}
[download]

The minor change in the text showed up (so I know I was running the intended version of the script), but the value assigned to the hidden parameter did not change -- it was still screwed up the same way as before! Now I'm really lost. (I should add that this is Perl 5.8.8 built for i386-freebsd-64int, with CGI v3.17.)

Comment on CGI hidden params vs. character encoding Select or Download Code

Replies are listed 'Best First'.
Re: CGI hidden params vs. character encoding by ikegami (Patriarch) on May 27, 2008 at 22:20 UTC
~~First of all, `decode( 'utf8', $untrusted )` is a security issue.~~ Secondly, UTF8 is a perl-specific encoding. UTF-8 is the actual encoding. It doesn't make sense to tell the browser you're using UTF8 (`-encoding => 'utf8'`). ~~I haven't pinpointed the problem, but changing UTF8 to UTF-8 throughout fixed the problem.~~	[reply] [d/l] [select]
Re^2: CGI hidden params vs. character encoding by graff (Chancellor) on May 27, 2008 at 22:41 UTC
First of all, decode( 'utf8', $untrusted ) is a security issue. Wouldn't that depend on what you do with the value that you get back from decode()? Also, what would be the remedy? I would expect it's okay to do something like `eval { decode( 'UTF-8', $untrusted, Encode::FB_CROAK ) }` and check $@, or maybe just pass the return value from decode() through a regex or other test for valid content. Secondly, UTF8 is a perl-specific encoding. UTF-8 is the actual encoding. I haven't pinpointed the problem, but changing UTF8 to UTF-8 throughout fixed the problem. Okay... I had to try twice -- I didn't get all the "utf8" strings changed over to "UTF-8" on the first try, but after I fixed the one I had forgotten ("binmode STDOUT..."), it worked. How strange... Thanks!!!	[reply] [d/l]
Re^3: CGI hidden params vs. character encoding by ikegami (Patriarch) on May 27, 2008 at 23:31 UTC
it worked. How strange... I found it strange too. I just clued in what the error is. First of all, `binmode STDOUT, ':utf-8';` [download] is a no-op, since there's no "utf-8" layer. `>perl -le"print binmode(STDERR, ':utf8')?1:0" 1 >perl -le"print binmode(STDERR, ':utf-8')?1:0" 0 >perl -le"print binmode(STDERR, ':encoding(utf8)')?1:0" 1 >perl -le"print binmode(STDERR, ':encoding(utf-8)')?1:0" 1` [download] If we do it properly (`:encoding(utf-8)`) we end up with your orignal problem. Your problem is that you are double-encoding! You're telling CGI to encode your data using UTF8 (`-charset => 'utf-8'`) and then you encode it again using `binmode STDOUT, ":utf8";`. The solution is to get rid of `binmode` completely and only use CGI's methods to output.	[reply] [d/l] [select]
Re^4: CGI hidden params vs. character encoding by graff (Chancellor) on May 28, 2008 at 00:41 UTC
Re^5: CGI hidden params vs. character encoding by ikegami (Patriarch) on May 28, 2008 at 01:24 UTC
Re^2: CGI hidden params vs. character encoding by graff (Chancellor) on May 27, 2008 at 23:21 UTC
Taking another look at the "utf8 security" issue, here's what I'm taking as the "primary reference" (at least, the one here at PM): UTF8 related proof of concept exploit released at T-DOSE. The key point, I think, is this: Once the UTF8 flag is set, Perl does not check the validity of the UTF8 sequences further. Typically, this is okay, because it was Perl that set the flag in the first place. However, some people set the UTF8 flag manually. They circumvent protection built into encoding/decoding functions and PerlIO layers, either because it's easier (less typing), for performance reasons, or even because they don't know they're doing something wrong. This problem is unrelated to the use of "decode()" shown in the OP script here. The "decode()" function is used to take a string (ignoring its utf8 flag) and try to interpret it as a utf8 byte string. Using "decode()" with its default behavior (as shown in the OP), any input bytes that are not interpretable as utf8 data will be replaced by the "?" character, and the result will always be a valid utf8 string (with the utf8 flag set by perl). My reading of the exploit is that you only get into trouble when you deliberately twiddle the utf8 flag of a scalar yourself, without checking to see whether it really is fully interpretable as valid utf8 characters. So I would conclude that the OP script is *not* a case that poses a security problem involving the use of utf8 data.	[reply]
Re^3: CGI hidden params vs. character encoding by ikegami (Patriarch) on May 27, 2008 at 23:46 UTC
This problem is unrelated to the use of "decode()" shown in the OP script here You're right. I thought `binmode($untrusted_fh, ':utf8')` was the same as `decode('utf8', $untrusted)`, but it's the same as `_utf8_on($untrusted)`. Read more... (4 kB)	[reply] [d/l] [select]
Re^2: CGI hidden params vs. character encoding by graff (Chancellor) on May 28, 2008 at 02:19 UTC
FINALLY FIGURED IT OUT! (...in a manner of speaking) #!/usr/bin/perl -T -w use strict; use CGI qw(:standard); use Encode; binmode STDOUT, ":utf8"; my $cgi = CGI->new(); print $cgi->header(-charset => 'UTF-8'), $cgi->start_html(-title => "Testing hidden-input character encoding", -encoding => 'UTF-8' ), $cgi->start_form; my $parms = $cgi->Vars; if ( $$parms{submit} ) { my $newtest = decode( 'utf8', $$parms{testtext} ); my $newhid = decode( 'utf8', $$parms{testtext_hid} ); delete $$parms{testtext_hid}; ### THIS IS WHAT FIXES THE PROB +LEM print "<p/> The testtext parameter as received was: ". $newtest, "<p/> The hidden parameter was: ". $newhid, $cgi->hidden( "testtext_hid", $newhid ); } else { my $testtext = "\x{444}\x{443}\x{431}\x{430}\x{440}"; print $cgi->textfield( -name => "testtext", -value => $testtext ), $cgi->hidden( "testtext_hid", $testtext ), $cgi->submit(-name => "submit", -value => "submit" ); } print $cgi->end_form, $cgi->end_html; [download] I can only guess what might be going on under the covers when CGI sees that it is being given a new value (with the utf8 flag on) that replaces one of the existing parameters already in the "context" of the form (which does not have its utf8 flag on, even though it may already contain valid utf8 data -- it comes from an untrusted source, after all). In any case, if I remove the existing parameter from the current "context", the assignment proceeds as expected -- no double encoding. All in all, it smells like a bug in CGI, but I'm sufficiently far enough behind in my coding at this point, that I'm happy enough just to know that there is a way to get the intended behavior. Case closed, as far as I'm concerned.	[reply] [d/l]
Re^3: CGI hidden params vs. character encoding by ikegami (Patriarch) on May 28, 2008 at 04:19 UTC
FINALLY FIGURED IT OUT! (...in a manner of speaking) oh, right! Here are two better ways: `print $cgi->p("The testtext parameter as received was: ". escapeHTML($ +newtest)), $cgi->p("The hidden parameter was: ". escapeHTML($newhid)), $cgi->hidden(-name=>"testtext_hid", -default=>$newhid, -override +=>1);` [download] `$cgi->param('testtext_hid', $newhid); print $cgi->p("The testtext parameter as received was: ". escapeHTML($ +newtest)), $cgi->p("The hidden parameter was: ". escapeHTML($newhid)), $cgi->hidden("testtext_hid");` [download] Note the `-override=>1` in the first snippet, or how the the second snippet sets the parameter instead of setting the default. Also note how I did the `P` elements. `<p/>` makes no sense. `<p/>text<p/>text` means `<p></p>text<p></p>text` but you want `<p>text</p><p>text</p>`. Finally, note how I used escapeHTML to avoid an injection attack and invalid HTML generation. All in all, it smells like a bug in CGI I agree. Not that you had to use `-override`, that's clearly documented under the "CREATING FILL-OUT FORMS" header. But how it handles (or rather doesn't handle) encodings other than iso-latin-1.	[reply] [d/l] [select]
Re^3: CGI hidden params vs. character encoding by Anonymous Monk on May 28, 2008 at 02:35 UTC
Try without Vars(), just using param()	[reply]