The missing piece in the puzzle is what sort of character encoding your WinNT-native command line app is using for Russian. It could be UTF-16LE (2-byte-characters), or it could be CP1251 (single-byte characters). It wouldn't make any sense to consider 8859-1 (Latin1), which is Western European, not Cyrillic; and I doubt that 8859-5 would come into play (the ISO single-byte table for Cyrillic), because this bears little or no relation to Microsoft's native CP1251.

In any case, the problem is that the command-line app is using one encoding, and the cgi script (and browsers) are using a different encoding. This in itself is not a show-stopper: Perl 5.8.0's Encode module (and/or the PerlIO layer) can easily handle the transliteration between different encodings. It's just that you need to know which other encoding you're dealing with (besides utf8).

Actually, there's another mystery (for me, at least, since I don't have a Russian keyboard): what character codes are being received by the cgi script from the web client when the form is submitted? When someone loads the form into their browser, goes to a type-in box, and hits the upper-left-most letter (next to <tab>, below "1" and "2", the key that is "Q" on standard English keyboards), what code point (what actual value) is emitted to the form, and which character encoding is it based on? (And likewise for the other letters.)

If you can answer that last question (and the answer is not utf8) then you can add some simple steps to the cgi code that will convert from that encoding into utf8 (for use internally in the cgi script), into numeric html entities (for storing to the registry and transmitting back to the client browser), and even into the WinNT native encoding (whichever one that might be).

The perldoc man page for the Encode module can help a lot with getting characters from CP1251 (or UTF16) into utf8 and back, and once you have a string stored in a scalar variable as utf8 text, you can use a regex like the following to convert that into numeric html entities:

s/([^[:ascii:]])/sprintf( "&#%d;", ord $1 )/eg
This takes each non-ASCII utf8 character in $_, and formats it as a numeric html entity. (See the "perlre" man page in 5.8.0, under "POSIX character class syntax", for more info on the ":ascii:" expression.)

In reply to Re: HTML Forms, utf-8, windows and perl by graff
in thread HTML Forms, utf-8, windows and perl by Avox

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.