jpvdv has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, I`ve been pondering over a strange problem quite a while now. Here is the situation. I have a MSSQLserver table with a nvarchar column containing strings in all kinds of languages. As an example, there are two rows I query, one containing an i with a ROOF ontop and one wih the same character followed by something unmistakably >8bit ascii, an A with a reverse ROOF. When I query these two lines and show them in a HTML page, the ROOFed I is shown correctly in the first line but not in second. That is, if the encoding in the browser (both IE and firefox) is set to UTF-8; if it is set to Western European, than it is the other way around: the Roofed I shows, and the Unicode character is printed "wide" in two ascii-chars. It seems that perl reads the Roofed-I differently from the db (ODBC driver) or write OUT, in the case of there being a two-byte character behind it or not. It's not the HTML, because the same query in a cmd-box shows the byte-count difference as well. The first ROOFED-I is represented by C4 83, the second by EE. It's even more weird if you see that both rows from the database return true on the regex m/ROOFED-I/... and even: ord() of the first character is in both cases is 238, the roofed-I in ASCII form as it were. In practice this means that Strings containing both 8-byte diacritical characters and >8 bits unicode characters could not be displayed in HTML. I'm sure I am doing somethingh wrong,..... who can help..? Grtz=JP

Replies are listed 'Best First'.
Re: Strange behaviour ODBC/Unicode in perl
by ikegami (Patriarch) on Feb 01, 2008 at 12:37 UTC

    Some quick background to help those answering your question:
    C4 83 is the UTF-8 encoding of ă (259)
    C3 AE is the UTF-8 encoding of î (238) (OP prob meant this)
    EE is the iso-latin-1 encoding of î (238)

      That`s right. Thanx.
      The REVERSE ROOFED a (259) is irrelevant only in the sense that the ROOFED i is in UTF-8 form ONLY when another character is in the String, that cannot be expressed in iso-latin-1 (Whatever that character is). Else, it 'falls back' to iso-latin-1 (238). Would there be a way to tell perl (perhaps in getting the data from the DBD-ODBC 1.14 with unicode support) to keep (force) the data in UTF-8 form...

        Noone else posted anything, so I'll give it a quick go...

        It seems you're assuming Perl's internal format is UTF-8.

        To be sure, it would help if I knew exactly what you were getting from the database. The "PV = " lines printed by following would provide that info.

        use Devel::Peek qw( Dump ); Dump($db_field); # Output sent to STDERR

        You should get one of the following four output combinations for chr(259) and chr(238).

        259: PV = 0x18e914c "\304\203"\0 [UTF8 "\x{103}"] 238: PV = 0x18e914c "\356"\0 259: PV = 0x18e914c "\304\203"\0 [UTF8 "\x{103}"] 238: PV = 0x18ecffc "\303\256"\0 [UTF8 "\x{ee}"] 259: PV = 0x18d73fc "?"\0 238: PV = 0x18d7444 "\356"\0 259: PV = 0x18d736c "\304\203"\0 238: PV = 0x18ecffc "\303\256"\0 259: PV = 0x18d74dc "\304\203"\0 <- Lack of [UTF8 ...] 238: PV = 0x18d74dc "\356"\0
        • 259: PV = 0x18e914c "\304\203"\0 [UTF8 "\x{103}"] 238: PV = 0x18e914c "\356"\0

          or

          259: PV = 0x18e914c "\304\203"\0 [UTF8 "\x{103}"] 238: PV = 0x18ecffc "\303\256"\0 [UTF8 "\x{ee}"]

          The data is in Perl's internal text format (called "UTF8", no dash). It can either be in iso-latin-1 or UTF-8. iso-latin-1 is preferred since it's faster, but it can only be used if every character in the string can be represented by iso-latin-1.

          Solution: You need to convert from Perl's internal format to a string of bytes.

          use Encode qw( encode ); print(encode('UTF-8', $text));

          or

          binmode(STDOUT, ':encoding(UTF-8)'); print($text);
        • 259: PV = 0x18d73fc "?"\0 238: PV = 0x18d7444 "\356"\0

          The data is in iso-latin-1. This can't be the case, since "ă" is working.

        • 259: PV = 0x18d736c "\304\203"\0 238: PV = 0x18ecffc "\303\256"\0

          The data is in UTF-8 already. Just print it out without re-encoding it.

        • 259: PV = 0x18d74dc "\304\203"\0 <- Lack of [UTF8 ...] 238: PV = 0x18d74dc "\356"\0

          That means the data in your database is bad, or your database is returning bad data.

          The problem should be fixed at the source, but I think it's possible to fix it at this point too.

        Update: Added second pair of output combinations, and fixed s/utf8/utf-8/