mdxi has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a multilingual dictionary-type thing which has a web interface as its initial frontend. The problem I am running into is that data coming from HTML forms doesn't seem to be recognized as Unicode, and thus fails regexp tests that it should pass (like /\p{InKatakana}/).

I ran into this problem when doing the initial import from a flatfile but was able to fix that by doing 'use open ":utf8";'. I've tried setting forms to 'accept-charset="utf-8"', and using the various techniques described in perldoc perlunicode and Encode, and even setting STDIN's binmode to utf8 (which I didn't expect to work but what the hey).

I know the forms data is *in* unicode because it will match things in the database, just not in the right place.

So my question boils down to: how do I get perl to believe that data from HTML forms is, in fact, Unicode?

Replies are listed 'Best First'.
Re: Unicode and Forms
by Aristotle (Chancellor) on Dec 14, 2003 at 19:33 UTC
    If the data is in fact already UTF-8 encoded, then the _utf8_on() function in the Encode module would be what you're looking for. It sets the interal UTF-8 flag on a string, but does not otherwise mangle it.

    Makeshifts last the longest.

      That got it, thanks. I know I tried this last night (read: at 0500 this morning) but I got unexpected results and backed it out. I can only assume I typo'd or did something else dumb at the time. Thanks also to liz for the header idea -- I was pretty sure I already had it that way, but I checked to be sure.
Re: Unicode and Forms
by liz (Monsignor) on Dec 14, 2003 at 19:44 UTC
    I you're sure you're always getting UTF-8 from the browsers, then everything should be hunky-dory after using Encode::_utf8_on.

    However, I've found that some browsers can get confused about the character encoding the HTML is in, and consequently with which encoding data should be sent to the server.

    I've found that if the Content-Type: output header has the string "; charset=utf-8" postfixed, then things tend to work out. So, for text/html, the content-type would then read:

    Content-Type: text/html; charset=utf-8

    Hope this helps.

    Liz

      ... everything should be hunky-dory after using Encode::_utf8_on.

      ... unless of course something isn't hunky-dory to begin with. To quote the Encode man page:

      Messing with Perl's Internals

      The following API uses parts of Perl's internals in the current implementation. As such, they are efficient but may change.
      ...
      _utf8_on(STRING)
      INTERNAL Turns on the UTF-8 flag in STRING. The data in STRING is not checked for being well-formed UTF-8. Do not use unless you know that the STRING is well-formed UTF-8. Returns the previous state of the UTF-8 flag (so please don't treat the return value as indicat­ing success or failure), or "undef" if STRING is not a string.

      (emphasis in the original). If there's any chance that the incoming data might really not be proper utf8, then just treating it as if it were utf8 won't help.

      The safer, more stable (non-internal) method for "upgrading" a string to utf8 is covered in the Encode man page above the section quoted here, as is the part about "Handling Malformed Data", which might be relevant to the OP.

Re: Unicode and Forms
by JamesNC (Chaplain) on Dec 14, 2003 at 19:56 UTC
    What version of perl? What OS are you working on? 5.8 handles unicode better IMO. It really helps other know this stuff. Don't be surprised by problems dealing with Unicode as a universal issue either. Microsoft Word 2000 will not correctly encode certain unicode character sets that is says it supports and cannot even encode UTF8 properly! While on the Win2K - notepad.exe does this just fine! Microsoft told us to upgrade to Office XP. Certain modules have to be told very early that unicode is being used for them. Unicode is not very well understood by most folks.
Re: Unicode and Forms
by graff (Chancellor) on Dec 15, 2003 at 03:16 UTC
    If I understand the situation correctly (not sure that I do), utf8 data is coming in on a CGI parameter string (or something like that), but at the point where the utf8 string is assigned to a scalar in the script, that scalar is not being flagged as holding utf8 data. In this case, the safer, more stable way (relative to what liz suggested above) to make perl use it as utf8 is to "decode" it:
    my $utf8_param = decode( 'utf8', $input_param );
    and maybe add the third arg to the decode call, so that it does something useful in case the input data turns out not to be valid utf8. See the section of the Encode man page headed "Handling Malformed Data": you can pass a "CHECK" parameter that says "die on error", and wrap the decode call in an eval block to see whether it worked, before moving on to doing regex matches involving unicode character classes.

    (If you don't check the result of the decode call, parts that it couldn't decode will show up as "\x{FFFD}", which would be "safe", but not very informative in terms of diagnostics.)