in reply to HTML Forms, utf-8, windows and perl
In any case, the problem is that the command-line app is using one encoding, and the cgi script (and browsers) are using a different encoding. This in itself is not a show-stopper: Perl 5.8.0's Encode module (and/or the PerlIO layer) can easily handle the transliteration between different encodings. It's just that you need to know which other encoding you're dealing with (besides utf8).
Actually, there's another mystery (for me, at least, since I don't have a Russian keyboard): what character codes are being received by the cgi script from the web client when the form is submitted? When someone loads the form into their browser, goes to a type-in box, and hits the upper-left-most letter (next to <tab>, below "1" and "2", the key that is "Q" on standard English keyboards), what code point (what actual value) is emitted to the form, and which character encoding is it based on? (And likewise for the other letters.)
If you can answer that last question (and the answer is not utf8) then you can add some simple steps to the cgi code that will convert from that encoding into utf8 (for use internally in the cgi script), into numeric html entities (for storing to the registry and transmitting back to the client browser), and even into the WinNT native encoding (whichever one that might be).
The perldoc man page for the Encode module can help a lot with getting characters from CP1251 (or UTF16) into utf8 and back, and once you have a string stored in a scalar variable as utf8 text, you can use a regex like the following to convert that into numeric html entities:
This takes each non-ASCII utf8 character in $_, and formats it as a numeric html entity. (See the "perlre" man page in 5.8.0, under "POSIX character class syntax", for more info on the ":ascii:" expression.)s/([^[:ascii:]])/sprintf( "&#%d;", ord $1 )/eg
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: HTML Forms, utf-8, windows and perl
by Avox (Sexton) on Dec 10, 2004 at 21:29 UTC |