Avox has asked for the wisdom of the Perl Monks concerning the following question:

I've got what some might consider an interesting problem.

I have a russian version of windows nt. On it I have a little web server that will run perl cgi's (it is the cygwin distro of perl v5.8.0).

My html form's page (generated by perl) is encoded in utf-8. In the submission script, I am running a small win32 command line application (compiled with unicode support enabled) which will set a registry key to the parameter I pass to it.

Now, I can run the application from the command line and it works correctly (although the russian you enter in the dos prompt looks like garbage, it is the correct string when it hits the registry).

If I enter the reference characters ( Я ) into the form, this seems to work just fine. I'd rather not have my users be forced to enter this type of input.

If my html page is encoded with iso-8859-1, then my russian looks like garbage, but when I submit russian into the form, it is written to the registry as its reference characters (ie Я ). Then I can change my browsers encoding to utf-8 and all is well.

So it looks to me like the html page encoding matters. Is there some way I can convert my form data into the reference characters (This seems to work just fine with everything)? Or is there some other encoding that will play with my win32 app better? Are there any other solutions that would be better?

Thanks in advance for any help.

Replies are listed 'Best First'.
Re: HTML Forms, utf-8, windows and perl
by graff (Chancellor) on Dec 10, 2004 at 03:51 UTC
    The missing piece in the puzzle is what sort of character encoding your WinNT-native command line app is using for Russian. It could be UTF-16LE (2-byte-characters), or it could be CP1251 (single-byte characters). It wouldn't make any sense to consider 8859-1 (Latin1), which is Western European, not Cyrillic; and I doubt that 8859-5 would come into play (the ISO single-byte table for Cyrillic), because this bears little or no relation to Microsoft's native CP1251.

    In any case, the problem is that the command-line app is using one encoding, and the cgi script (and browsers) are using a different encoding. This in itself is not a show-stopper: Perl 5.8.0's Encode module (and/or the PerlIO layer) can easily handle the transliteration between different encodings. It's just that you need to know which other encoding you're dealing with (besides utf8).

    Actually, there's another mystery (for me, at least, since I don't have a Russian keyboard): what character codes are being received by the cgi script from the web client when the form is submitted? When someone loads the form into their browser, goes to a type-in box, and hits the upper-left-most letter (next to <tab>, below "1" and "2", the key that is "Q" on standard English keyboards), what code point (what actual value) is emitted to the form, and which character encoding is it based on? (And likewise for the other letters.)

    If you can answer that last question (and the answer is not utf8) then you can add some simple steps to the cgi code that will convert from that encoding into utf8 (for use internally in the cgi script), into numeric html entities (for storing to the registry and transmitting back to the client browser), and even into the WinNT native encoding (whichever one that might be).

    The perldoc man page for the Encode module can help a lot with getting characters from CP1251 (or UTF16) into utf8 and back, and once you have a string stored in a scalar variable as utf8 text, you can use a regex like the following to convert that into numeric html entities:

    s/([^[:ascii:]])/sprintf( "&#%d;", ord $1 )/eg
    This takes each non-ASCII utf8 character in $_, and formats it as a numeric html entity. (See the "perlre" man page in 5.8.0, under "POSIX character class syntax", for more info on the ":ascii:" expression.)
      Thanks for your help! You pointed me in the write direction! I had been messing around with Encode while investigating all this, but didn't try CP1251! With 1251 it all worked!

      Excellent you might think, right? Well, in my original question I tried to simplify my situation by limiting myself to russian. I didn't mention the other 15 languages I had to support (on their native NT istallations).

      However, knowing I could do this with 1251, I looked in my c++ application's code pages for the encoding each language was using. I then listed all the encodings that Encode supports. Low and behold all of them existed in there. So I created a function that polls the OS to see what language we are using, then encoded from utf-8 into the appropriate encoding for that language!

      Perl is the glue that holds my life together... although sometimes I wish it wasn't quite so sticky! (actually, it was perl that made my solution possible, I wish internationization wasn't so sticky! :D)

      Thanks again!
Re: HTML Forms, utf-8, windows and perl
by dragonchild (Archbishop) on Dec 10, 2004 at 13:47 UTC
    Just curious, but
    • is the browser you're using IE6
    • do you have your browser's encoding set to Auto-Select
    • are you specifying the encoding within a <META> tag

    If you are, please let me know - I'm currently tracking down a rather weird bug where IE6's Auto-Select doesn't always work and it's screwing up some UTF-8 input/display stuff.

    Update: For those who care, it turns out that for popup windows, under certain circumstances, IE will screw up the Auto-Select if you don't have a <title> tag. Go figure.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      I'm using IE 5.5 (Can't upgrade...*sigh*)
      I'm not using auto-select
      I specify utf8 in the meta tag.