Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Encoding confusion with CGI forms

by davistv (Acolyte)
on Oct 21, 2004 at 21:54 UTC ( [id://401315]=perlquestion: print w/replies, xml ) Need Help??

davistv has asked for the wisdom of the Perl Monks concerning the following question:

I've been struggling to properly convert text into HTML with sensical escape codes from text that is pasted into a web form. The problem is that I can't count on the pasted text to be any particular encoding. Sometimes it comes from a web page in UTF-8, sometimes it's ISO-8859-1, and sometimes it's Windows 1252. I've done lots of Google searches, tried Text::Iconv (I get no output from it for some odd reason), and I've also tried Unicode::String (lossy conversion) and Unicode::Map8, all to no avail.

I thought that the browser did an encoding conversion based on the content-type header, but that doesn't seem to be the case. Is there any practical way to translate text submitted via a web form into HTML?

The server is Fedora 2, LANG is en_US.UTF-8. Clients are mostly Windows plus a few Mac OS X systems.

Thank You,
Troy

Replies are listed 'Best First'.
Re: Encoding confusion with CGI forms
by borisz (Canon) on Oct 21, 2004 at 22:09 UTC
    The clients send the data in a some encoding. You need to know the format. The key trick is to add a hidden field to the form. This field is also converted to the format that the clients use. Since you know how the hidden data look, you know the encoding of the formdata. Just convert them now into a common format and you are done.
    Boris
      Sounds like a nice trick. Especially since the official ways of handling request character encoding are not well supported and not very well thought out either (i.e. as far as I know a user-agent is only required to send charset information for multi-part forms)

      I'm just wondering about what string to use, though. There are dozens of encodings in use around the world, and you should be able (ideally) to recognize each one. Is there any "standard" way of doing this? (a CPAN module would be wonderful, ofcourse)

        I life in Germany so my string is only '' to distingush between ISO-8859-1, utf8 and unknown. But you can extend this to all your supported encodings, just find a char with different representations.
        Boris
        I really like this approach as well, I'm just having trouble coming up with a string that degrades in some predictable manner for different encodings.

        A perl module for this would be awesome, btw! It would be even more automagical if it were integrated into CGI.pm behind the scenes!

        Cheers,
        Troy

      Hi Boris,

      The hidden form variable sounds like a great strategy, thank you for the tip. I've been playing with this concept today, but I'm not getting useful results so far.

      If I use the unicode smiley character as my hidden form field value (\x{263a}). I get back a character sequence like "☺", or \xe2\x98\xba in latin1. I can detect that with the regex /^\xe2\x98\xba$/, but even if the default encoding for Apache is UTF-8, the content-type charset in the resulting page is utf-8, the script is utf-8 (no BOM or perl gives an error like "(8)Exec format error: exec of '/var/www/cgi-bin/char2.cgi' failed"), and my browser is setting itself to UTF8 encoding as it should, and I copy and paste text from a document known to be in UTF-8, it's detected as latin 1 and not utf-8.

      So can I ask what string you use as a detection mechanism? And what are you using to match the mis-converted string in other encodings? I'm interested in Win1252 and Latin 1, if that makes any difference. My current source is below.

      Thank You,
      Troy

        Hi Troy,
        here is a untested example. Copy it to your cgi-bin/troy.pl. Without parameters it presenrs the submit form otherwise it shows the data everytime in utf8.
        The error in your script is that you confuse utf8 and unicode somewhere. Also the meta tags looks unnecessarily to me.
        PS: Consider to add readmore tags around your code on pm.
        #!/usr/bin/perl use strict; use Encode; use CGI; $|++; my $c = CGI->new; binmode( STDOUT, ":utf8" ); unless ( () = $c->param ) { print $c->header( -charset => 'utf-8' ), $c->start_html('Character conversion test'), qq{ <form action="/cgi-bin/troy.pl" method="post" enctype="multipart/f +orm-data"> <input type="hidden" name="enc_sniffer" value="\x{df}\x{20ac}"> <textarea name="textInput" rows="25" cols="72"></textarea><p> <input type="submit"> </form>}, $c->end_html; exit; } else { my $enc; my $hidden = $c->param('enc_sniffer'); { use bytes; $enc = ( $hidden eq "\xc3\x9f\xe2\x82\xac" ) ? 'utf8' : ( $hidden eq "\xdf\x80" ) ? 'cp1252' : 'iso-8859-1'; } # decode all param fields here my $qtext = decode( $enc, $c->param('textInput') ); print $c->header( -charset => 'utf-8' ), $c->start_html("Encoding is + $enc"), $enc, " ", $qtext, $c->end_html; }
        Boris
Re: Encoding confusion with CGI forms
by iburrell (Chaplain) on Oct 22, 2004 at 19:04 UTC
    What charset are you setting on the Content-Type header? Browsers should use that charset for form submissions. In my experience, UTF-8 works well. As does windows-1252, which is the iso-8859-1 used by Windows.

    There is an accept-charset attribute on the form tag but it is not well supported. Also, with POSTs browsers should include a charset in the HTTP request header but that is also not well supported.

    This looks like a good primer: http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

      My content-type header looks like:
      <meta http-equiv="content-type" content="text/html; charset=utf-8">

      The page, script and Apache configuration are all set to UTF-8 right now, but for instance, when I paste text from MS Word into the form, it strips out all of the hi-bit characters. I just posted the source for the script to this thread, please take a look.

      And I'll defnitely read that page on i18n form usage, thank you for the link. I'm not sure if it will fix the problem I'm dealing with or not, though. The biggest problem is that I can't count on the page encoding dictating the string encoding I get back, so I'm trying to detect it with a hidden form variable instead. The accept-charset attribute just doesn't seem to work with my client's machines.

      I might have to try the windows-1252 charset option you mention. I'm not sure if that'll fix all of the broken behavior either, though. I've still got to escape the text to Latin1 with HTML entities before inserting into MySQL (no unicode support).

      Thank You,
      Troy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://401315]
Approved by claree0
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (2)
As of 2024-04-20 13:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found