graff has asked for the wisdom of the Perl Monks concerning the following question:
(updated to fix indenting)#!/usr/bin/perl -T -w use strict; use CGI qw(:standard); use Encode; binmode STDOUT, ":utf8"; my $cgi = CGI->new(); print $cgi->header(-charset => 'utf-8'), $cgi->start_html(-title => "Testing hidden-input character encod +ing", -encoding => 'utf8' ), $cgi->start_form; my $parms = $cgi->Vars; if ( $$parms{submit} ) { my $newtest = decode( 'utf8', $$parms{testtext} ); my $newhid = decode( 'utf8', $$parms{testtext_hid} ); print "<p/> The testtext parameter as received was: ". $newtest, "<p/> The hidden parameter was: ". $newhid, $cgi->hidden( "testtext_hid", $newhid ); } else { my $testtext = "\x{444}\x{443}\x{431}\x{430}\x{440}"; print $cgi->textfield( -name => "testtext", -value => $testtext ), $cgi->hidden( "testtext_hid", $testtext ), $cgi->submit(-name => "submit", -value => "submit" ); } print $cgi->end_form, $cgi->end_html;
If the script is installed as "/cgi-bin/test.cgi", the first time you put that url into your browser, here is what you get back (bear in mind that the PM "code" tags are forcing the actual utf8 characters, which happen to be Cyrillic letters, into numeric entities, and this has nothing to do with the problem):
Now, just hit the submit button, and what comes back is all fine, except for this part:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-U +S"> <head> <title>Testing hidden-input character encoding</title> <meta http-equiv="Content-Type" content="text/html; charset=utf8" /> </head> <body> <form method="post" action="/cgi-bin/test.cgi" enctype="multipart/form +-data"> <input type="text" name="testtext" value="фуба +р" /><input type="hidden" name="testtext_hid" value="ф +091;бар" /><input type="submit" name="submit" valu +e="submit" /></form> </body> </html>
Note the difference in the value being assigned to the "hidden" parameter ("testtext_hid"). What has happened is that the string of 5 utf8-encoded Russian letters has been treated as a string of 10 single-byte characters (i.e. as non-utf8 Latin-1 data), and each byte has been re-encoded as a two-byte utf8 character....<input type="hidden" name="testtext_hid" value="Ñ„ÑƒÐ±Ð°Ñ +€" />...
To clarify, here is the actual byte sequence of the hidden parameter value from the script's initial output, followed by the 4-digit unicode code points that result from treating that byte sequence as utf8 characters:
and here is the equivalent detail from the second output (the value being assigned via <input type="hidden"...):d1 84 d1 83 d0 b1 d0 b0 d1 80 0444 0443 0431 0430 0440
c3 91 c2 84 c3 91 c2 83 c3 90 c2 b1 c3 90 c2 b0 c3 91 c2 80 00d1 0084 00d1 0083 00d0 00b1 00d0 00b0 00d1 0080
This has me stumped. It seems to me that there is no difference in how the string value is assigned to the hidden parameter in the two blocks of the script: in both cases, a "known utf8 string" (i.e. with the utf8 flag turned on) is being assigned via the CGI module as the value of the hidden parameter.
So why does it get treated differently in the two cases -- why are there two different versions of the hidden param value -- and more importantly, how can I get CGI to behave as expected? What bonehead simple fact am I missing?
(BTW, I have tried using "is_utf8" on the incoming "$parm{testtext_hid}" value, such that "decode()" would only be used if in fact the param value was not already flagged as a utf8 string; this was overkill, because the appearance of the "$newhid" string as part of the page content is correct when the script is run as posted. Also, the script as posted does not produce any errors or warnings.)
Update: to make things even more puzzling, I amended the troublesome code block as follows:
The minor change in the text showed up (so I know I was running the intended version of the script), but the value assigned to the hidden parameter did not change -- it was still screwed up the same way as before! Now I'm really lost. (I should add that this is Perl 5.8.8 built for i386-freebsd-64int, with CGI v3.17.)if ( $$parms{submit} ) { my $newtest = decode( 'utf8', $$parms{testtext} ); my $newhid = decode( 'utf8', $$parms{testtext_hid} ); print "<p/> The testtext parameter as received was: ". $newtest, "<p/> The hidden parameter has been decoded as: ". $newhid, $cgi->hidden( "testtext_hid", $$parms{testtext_hid} ); # note: changed text content slightly, and used "raw" hidden v +alue (not "decoded" value) }
|
|---|