comment on

Please consider the following little test.cgi script -- it does nothing, except to demonstrate the problem I'm having in passing a "hidden" parameter value from one form/page to the next:

#!/usr/bin/perl -T -w

use strict;
use CGI qw(:standard);
use Encode;

binmode STDOUT, ":utf8";

my $cgi = CGI->new();
print $cgi->header(-charset => 'utf-8'),
      $cgi->start_html(-title => "Testing hidden-input character encod
+ing",
                       -encoding => 'utf8' ),
      $cgi->start_form;

my $parms = $cgi->Vars;
if ( $$parms{submit} ) {
    my $newtest = decode( 'utf8', $$parms{testtext} );
    my $newhid = decode( 'utf8', $$parms{testtext_hid} );
    print "<p/> The testtext parameter as received was: ". $newtest,
          "<p/> The hidden parameter was: ". $newhid,
          $cgi->hidden( "testtext_hid", $newhid );
}
else {
    my $testtext = "\x{444}\x{443}\x{431}\x{430}\x{440}";
    print $cgi->textfield( -name => "testtext", -value => $testtext ),
          $cgi->hidden( "testtext_hid", $testtext ),
          $cgi->submit(-name => "submit", -value => "submit" );
}
print $cgi->end_form, $cgi->end_html;
[download]

(updated to fix indenting)

If the script is installed as "/cgi-bin/test.cgi", the first time you put that url into your browser, here is what you get back (bear in mind that the PM "code" tags are forcing the actual utf8 characters, which happen to be Cyrillic letters, into numeric entities, and this has nothing to do with the problem):

<!DOCTYPE html
    PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-U
+S">
<head>
<title>Testing hidden-input character encoding</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf8" />
</head>
<body>
<form method="post" action="/cgi-bin/test.cgi" enctype="multipart/form
+-data">
<input type="text" name="testtext" value="&#1092;&#1091;&#1073;&#1072;
+&#1088;" /><input type="hidden" name="testtext_hid" value="&#1092;&#1
+091;&#1073;&#1072;&#1088;"  /><input type="submit" name="submit" valu
+e="submit" /></form>
</body>
</html>
[download]

Now, just hit the submit button, and what comes back is all fine, except for this part:

...<input type="hidden" name="testtext_hid" value="б&#132;б&#131;аБаАб
+&#128;" />...
[download]

Note the difference in the value being assigned to the "hidden" parameter ("testtext_hid"). What has happened is that the string of 5 utf8-encoded Russian letters has been treated as a string of 10 single-byte characters (i.e. as non-utf8 Latin-1 data), and each byte has been re-encoded as a two-byte utf8 character.

To clarify, here is the actual byte sequence of the hidden parameter value from the script's initial output, followed by the 4-digit unicode code points that result from treating that byte sequence as utf8 characters:

d1 84  d1 83  d0 b1  d0 b0  d1 80
 0444   0443   0431   0430   0440
[download]

and here is the equivalent detail from the second output (the value being assigned via <input type="hidden"...):

c3 91  c2 84  c3 91  c2 83  c3 90  c2 b1  c3 90  c2 b0  c3 91  c2 80
 00d1   0084   00d1   0083   00d0   00b1   00d0   00b0   00d1   0080
[download]

This has me stumped. It seems to me that there is no difference in how the string value is assigned to the hidden parameter in the two blocks of the script: in both cases, a "known utf8 string" (i.e. with the utf8 flag turned on) is being assigned via the CGI module as the value of the hidden parameter.

So why does it get treated differently in the two cases -- why are there two different versions of the hidden param value -- and more importantly, how can I get CGI to behave as expected? What bonehead simple fact am I missing?

(BTW, I have tried using "is_utf8" on the incoming "$parm{testtext_hid}" value, such that "decode()" would only be used if in fact the param value was not already flagged as a utf8 string; this was overkill, because the appearance of the "$newhid" string as part of the page content is correct when the script is run as posted. Also, the script as posted does not produce any errors or warnings.)

Update: to make things even more puzzling, I amended the troublesome code block as follows:

if ( $$parms{submit} ) {
    my $newtest = decode( 'utf8', $$parms{testtext} );
    my $newhid = decode( 'utf8', $$parms{testtext_hid} );
    print "<p/> The testtext parameter as received was: ". $newtest,
    "<p/> The hidden parameter has been decoded as: ". $newhid, 
    $cgi->hidden( "testtext_hid", $$parms{testtext_hid} );
        # note: changed text content slightly, and used "raw" hidden v
+alue (not "decoded" value)
}
[download]

The minor change in the text showed up (so I know I was running the intended version of the script), but the value assigned to the hidden parameter did not change -- it was still screwed up the same way as before! Now I'm really lost. (I should add that this is Perl 5.8.8 built for i386-freebsd-64int, with CGI v3.17.)

In reply to CGI hidden params vs. character encoding by graff

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.