vic has asked for the wisdom of the Perl Monks concerning the following question:

I have a hash hard coded in the program (in ascii) and then I read some string from a XML file (in utf-8, string extract via libXML), the string and one of the key in hash are identical in content (same english alphabet), but the hash return nothing if I write in this way:

$value = $hard_code_hash{$utf8_string}

what happened?

I guess it may be the problem of the utf-8 string (query) and the non utf-8 string (key), but why it happen and any solution to this?

*the perl version is 5.8.5 (unix) and I can't upgrade it; store the perl file in utf-8 and call "use utf8" is not sound good as I do the programming on window, the utf-8 signature kill the perl interpreter

Replies are listed 'Best First'.
Re: hard-coded hash return nothing for utf-8 string
by kyle (Abbot) on Feb 21, 2008 at 18:43 UTC

    I tried to reproduce this on a later version of Perl (5.8.8) and couldn't. Here's the test I tried:

    I wonder if there's some difference between the version you have and the version I have, so I wonder if that test does the same thing for you that it does for me. I get the same results regardless of whether I have use utf8 at the top.

Re: hard-coded hash return nothing for utf-8 string
by Joost (Canon) on Feb 21, 2008 at 19:09 UTC
    *the perl version is 5.8.0 (unix) and I can't upgrade it; store the perl file in utf-8 and call "use utf8" is not sound good as I do the programming on window, the utf-8 signature kill the perl interpreter
    Regardless of this bug is really a perl bug or something in your code, for serious unicode work you should upgrade both perls to at least 5.8.8. Various utf8 bugs have been fixed and semantics have been changed since 5.8.0. perl581delta says for instance:
    For example, if you had "en_US.UTF-8" as your locale, your STDIN and STDOUT were automatically "UTF-8", in other words an implicit bin‐ mode(..., ":utf8") was made. This meant that trying to print, say, chr(0xff), ended up printing the bytes 0xc3 0xbf. Hardly what you had in mind unless you were aware of this feature of Perl 5.8.0. The problem is that the vast majority of people weren’t: for example in RedHat releases 8 and 9 the default locale setting is UTF-8, so all RedHat users got UTF-8 filehandles, whether they wanted it or not. The pain was intensified by the Unicode implementation of Perl 5.8.0 (still) having nasty bugs, especially related to the use of s/// and tr///. (Bugs that have been fixed in 5.8.1)

    Therefore a decision was made to backtrack the feature and change it from implicit silent default to explicit conscious option. The new Perl command line option "-C" and its counterpart environment vari‐ able PERL_UNICODE can now be used to control how Perl and Unicode interact at interfaces like I/O and for example the command line arguments. See "-C" in perlrun and "PERL_UNICODE" in perlrun for more information.

    Yes I noticed that you said you can't upgrade it. Ask again.

      Thank you for your detailed explaination.

      Now I double checked the perl version, and this is 5.8.5. I will try to reproduce the bug later.

Re: hard-coded hash return nothing for utf-8 string
by Narveson (Chaplain) on Feb 21, 2008 at 18:27 UTC

    $utf-8_string is not a legal identifier. Underscore is allowed, but dash is interpreted as a minus sign.