Hi Boris,

The hidden form variable sounds like a great strategy, thank you for the tip. I've been playing with this concept today, but I'm not getting useful results so far.

If I use the unicode smiley character as my hidden form field value (\x{263a}). I get back a character sequence like "☺", or \xe2\x98\xba in latin1. I can detect that with the regex /^\xe2\x98\xba$/, but even if the default encoding for Apache is UTF-8, the content-type charset in the resulting page is utf-8, the script is utf-8 (no BOM or perl gives an error like "(8)Exec format error: exec of '/var/www/cgi-bin/char2.cgi' failed"), and my browser is setting itself to UTF8 encoding as it should, and I copy and paste text from a document known to be in UTF-8, it's detected as latin 1 and not utf-8.

So can I ask what string you use as a detection mechanism? And what are you using to match the mis-converted string in other encodings? I'm interested in Win1252 and Latin 1, if that makes any difference. My current source is below.

Thank You,
Troy

#!/usr/bin/perl use utf8; use strict; use Unicode::String qw(utf8 latin1 utf16); use Encode; use CGI; use HTML::Entities; require Unicode::Map8; my $smiley = "\x{263a}"; my $l1_map = Unicode::Map8->new("latin1") || die; my $win_map = Unicode::Map8->new("cp1252") || die; my $cgiq = new CGI; my $qtext = $cgiq->param('textInput'); binmode(STDOUT, ":utf8"); print $cgiq->header(-charset=>'utf-8'); print '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=ut +f-8"> <title>Character conversion test</title> </head> <body bgcolor="#ffffff"> '; my $encoded = ''; if ($cgiq->param('enc_sniffer') =~ /^\x{263a}$/) { print "<p>Unicode encoding detected.</p>\n"; my $u = utf8($qtext); my $converted = $u->latin1; $encoded = encode_entities($converted); } elsif ($cgiq->param('enc_sniffer') =~ /^\xe2\x98\xba$/ ) { print "<p>Latin1 encoding detected.</p>\n"; my $u = utf8($qtext); my $converted = $u->latin1; $encoded = encode_entities($converted); } elsif ($cgiq->param('enc_sniffer') =~ /not sure what to put here/ ) { print "<p>Windows 1252 encoding detected.</p>\n"; $encoded = TransWin1252($qtext); } print ' enc_sniffer: ' . $smiley; print "\n<p></p>\n"; print ' Text submitted:<br>' . $qtext . ' <p></p>'; print ' Encoded:<br>' . $encoded . ' <p></p> <form action="/cgi-bin/char2.cgi" method="post" enctype="m +ultipart/form-data"> <input type="hidden" name="enc_sniffer" value="' . $sm +iley . '"> <textarea name="textInput" rows="25" cols="72"></texta +rea><p> <input type="submit"> </form> </p> <p></p> </body> </html>'; exit; sub TransWin1252 { my $s = $_[0]; $s =~ s/\x80/&euro;/g; $s =~ s/\x81/ /g; $s =~ s/\x82/&sbquo;/g; $s =~ s/\x83/&fnof;/g; $s =~ s/\x84/&bdquo;/g; $s =~ s/\x85/&hellip;/g; $s =~ s/\x86/&dagger;/g; $s =~ s/\x87/&Dagger;/g; $s =~ s/\x88/&circ;/g; $s =~ s/\x89/&permil;/g; $s =~ s/\x8A/&Scaron;/g; $s =~ s/\x8B/&lsaquo;/g; $s =~ s/\x8C/&OElig;/g; $s =~ s/\x8D/ /g; $s =~ s/\x8E/&Zcaron;/g; $s =~ s/\x8F/ /g; $s =~ s/\x90/ /g; $s =~ s/\x91/&lsquo;/g; $s =~ s/\x92/&rsquo;/g; $s =~ s/\x93/&ldquo;/g; $s =~ s/\x94/&rdquo;/g; $s =~ s/\x95/&bull;/g; $s =~ s/\x96/&ndash;/g; $s =~ s/\x97/&mdash;/g; $s =~ s/\x98/&tilde;/g; $s =~ s/\x99/&trade;/g; $s =~ s/\x9A/&scaron;/g; $s =~ s/\x9B/&rsaquo;/g; $s =~ s/\x9C/&oelig;/g; $s =~ s/\x9D/ /g; $s =~ s/\x9E/&zcaron;/g; $s =~ s/\x9F/&Yuml;/g; $s =~ s/\xA0/&nbsp;/g; $s =~ s/\xA1/&iexcl;/g; $s =~ s/\xA2/&cent;/g; $s =~ s/\xA3/&pound;/g; $s =~ s/\xA4/&curren;/g; $s =~ s/\xA5/&yen;/g; $s =~ s/\xA6/&brvbar;/g; $s =~ s/\xA7/&sect;/g; $s =~ s/\xA8/&uml;/g; $s =~ s/\xA9/&copy;/g; $s =~ s/\xAA/&ordf;/g; $s =~ s/\xAB/&laquo;/g; $s =~ s/\xAC/&not;/g; $s =~ s/\xAD/&shy;/g; $s =~ s/\xAE/&reg;/g; $s =~ s/\xAF/&macr;/g; $s =~ s/\xB0/&deg;/g; $s =~ s/\xB1/&plusmn;/g; $s =~ s/\xB2/&sup2;/g; $s =~ s/\x83/&sup3;/g; $s =~ s/\xB4/&acute;/g; $s =~ s/\xB5/&micro;/g; $s =~ s/\xB6/&para;/g; $s =~ s/\xB7/&middot;/g; $s =~ s/\xB8/&cedil;/g; $s =~ s/\xB9/&sup1;/g; $s =~ s/\xBA/&ordm;/g; $s =~ s/\xBB/&raquo;/g; $s =~ s/\xBC/&frac14;/g; $s =~ s/\xBD/&frac12;/g; $s =~ s/\xBE/&frac34;/g; $s =~ s/\xBF/&iquest;/g; $s =~ s/\xC0/&Agrave;/g; $s =~ s/\xC1/&Aacute;/g; $s =~ s/\xC2/&Acirc;/g; $s =~ s/\x83/&Atilde;/g; $s =~ s/\xC4/&Auml;/g; $s =~ s/\xC5/&Aring;/g; $s =~ s/\xC6/&AElig;/g; $s =~ s/\xC7/&Ccedil;/g; $s =~ s/\xC8/&Egrave;/g; $s =~ s/\xC9/&Eacute;/g; $s =~ s/\xCA/&Ecirc;/g; $s =~ s/\xCB/&Euml;/g; $s =~ s/\xCC/&Igrave;/g; $s =~ s/\xCD/&Iacute;/g; $s =~ s/\xCE/&Icirc;/g; $s =~ s/\xCF/&Iuml;/g; $s =~ s/\xD0/&ETH;/g; $s =~ s/\xD1/&Ntilde;/g; $s =~ s/\xD2/&Ograve;/g; $s =~ s/\x83/&Oacute;/g; $s =~ s/\xD4/&Ocirc;/g; $s =~ s/\xD5/&Otilde;/g; $s =~ s/\xD6/&Ouml;/g; $s =~ s/\xD7/&times;/g; $s =~ s/\xD8/&Oslash;/g; $s =~ s/\xD9/&Ugrave;/g; $s =~ s/\xDA/&Uacute;/g; $s =~ s/\xDB/&Ucirc;/g; $s =~ s/\xDC/&Uuml;/g; $s =~ s/\xDD/&Yacute;/g; $s =~ s/\xDE/&THORN;/g; $s =~ s/\xDF/&szlig;/g; $s =~ s/\xE0/&agrave;/g; $s =~ s/\xE1/&aacute;/g; $s =~ s/\xE2/&acirc;/g; $s =~ s/\x83/&atilde;/g; $s =~ s/\xE4/&auml;/g; $s =~ s/\xE5/&aring;/g; $s =~ s/\xE6/&aelig;/g; $s =~ s/\xE7/&ccedil;/g; $s =~ s/\xE8/&egrave;/g; $s =~ s/\xE9/&eacute;/g; $s =~ s/\xEA/&ecirc;/g; $s =~ s/\xEB/&euml;/g; $s =~ s/\xEC/&igrave;/g; $s =~ s/\xED/&iacute;/g; $s =~ s/\xEE/&icirc;/g; $s =~ s/\xEF/&iuml;/g; $s =~ s/\xF0/&eth;/g; $s =~ s/\xF1/&ntilde;/g; $s =~ s/\xF2/&ograve;/g; $s =~ s/\x83/&oacute;/g; $s =~ s/\xF4/&ocirc;/g; $s =~ s/\xF5/&otilde;/g; $s =~ s/\xF6/&ouml;/g; $s =~ s/\xF7/&divide;/g; $s =~ s/\xF8/&oslash;/g; $s =~ s/\xF9/&ugrave;/g; $s =~ s/\xFA/&uacute;/g; $s =~ s/\xFB/&ucirc;/g; $s =~ s/\xFC/&uuml;/g; $s =~ s/\xFD/&yacute;/g; $s =~ s/\xFE/&thorn;/g; $s =~ s/\xFF/&yuml;/g; return($s); }

In reply to Re^2: Encoding confusion with CGI forms by davistv
in thread Encoding confusion with CGI forms by davistv

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.