Re: HTML Forms, utf-8, windows and perl

The missing piece in the puzzle is what sort of character encoding your WinNT-native command line app is using for Russian. It could be UTF-16LE (2-byte-characters), or it could be CP1251 (single-byte characters). It wouldn't make any sense to consider 8859-1 (Latin1), which is Western European, not Cyrillic; and I doubt that 8859-5 would come into play (the ISO single-byte table for Cyrillic), because this bears little or no relation to Microsoft's native CP1251.

In any case, the problem is that the command-line app is using one encoding, and the cgi script (and browsers) are using a different encoding. This in itself is not a show-stopper: Perl 5.8.0's Encode module (and/or the PerlIO layer) can easily handle the transliteration between different encodings. It's just that you need to know which other encoding you're dealing with (besides utf8).

Actually, there's another mystery (for me, at least, since I don't have a Russian keyboard): what character codes are being received by the cgi script from the web client when the form is submitted? When someone loads the form into their browser, goes to a type-in box, and hits the upper-left-most letter (next to <tab>, below "1" and "2", the key that is "Q" on standard English keyboards), what code point (what actual value) is emitted to the form, and which character encoding is it based on? (And likewise for the other letters.)

If you can answer that last question (and the answer is not utf8) then you can add some simple steps to the cgi code that will convert from that encoding into utf8 (for use internally in the cgi script), into numeric html entities (for storing to the registry and transmitting back to the client browser), and even into the WinNT native encoding (whichever one that might be).

The perldoc man page for the Encode module can help a lot with getting characters from CP1251 (or UTF16) into utf8 and back, and once you have a string stored in a scalar variable as utf8 text, you can use a regex like the following to convert that into numeric html entities:

  s/([^[:ascii:]])/sprintf( "&#%d;", ord $1 )/eg
[download]

This takes each non-ASCII utf8 character in $_, and formats it as a numeric html entity. (See the "perlre" man page in 5.8.0, under "POSIX character class syntax", for more info on the ":ascii:" expression.)

Comment on Re: HTML Forms, utf-8, windows and perl Download Code

Replies are listed 'Best First'.
Re^2: HTML Forms, utf-8, windows and perl by Avox (Sexton) on Dec 10, 2004 at 21:29 UTC
Thanks for your help! You pointed me in the write direction! I had been messing around with Encode while investigating all this, but didn't try CP1251! With 1251 it all worked! Excellent you might think, right? Well, in my original question I tried to simplify my situation by limiting myself to russian. I didn't mention the other 15 languages I had to support (on their native NT istallations). However, knowing I could do this with 1251, I looked in my c++ application's code pages for the encoding each language was using. I then listed all the encodings that Encode supports. Low and behold all of them existed in there. So I created a function that polls the OS to see what language we are using, then encoded from utf-8 into the appropriate encoding for that language! Perl is the glue that holds my life together... although sometimes I wish it wasn't quite so sticky! (actually, it was perl that made my solution possible, I wish internationization wasn't so sticky! :D) Thanks again!	[reply]

Replies are listed 'Best First'.

Re^2: HTML Forms, utf-8, windows and perl
by Avox (Sexton) on Dec 10, 2004 at 21:29 UTC

[reply]