in reply to Re^2: Strange behaviour ODBC/Unicode in perl
in thread Strange behaviour ODBC/Unicode in perl

Noone else posted anything, so I'll give it a quick go...

It seems you're assuming Perl's internal format is UTF-8.

To be sure, it would help if I knew exactly what you were getting from the database. The "PV = " lines printed by following would provide that info.

use Devel::Peek qw( Dump ); Dump($db_field); # Output sent to STDERR

You should get one of the following four output combinations for chr(259) and chr(238).

259: PV = 0x18e914c "\304\203"\0 [UTF8 "\x{103}"] 238: PV = 0x18e914c "\356"\0 259: PV = 0x18e914c "\304\203"\0 [UTF8 "\x{103}"] 238: PV = 0x18ecffc "\303\256"\0 [UTF8 "\x{ee}"] 259: PV = 0x18d73fc "?"\0 238: PV = 0x18d7444 "\356"\0 259: PV = 0x18d736c "\304\203"\0 238: PV = 0x18ecffc "\303\256"\0 259: PV = 0x18d74dc "\304\203"\0 <- Lack of [UTF8 ...] 238: PV = 0x18d74dc "\356"\0

Update: Added second pair of output combinations, and fixed s/utf8/utf-8/

Replies are listed 'Best First'.
Re^4: Strange behaviour ODBC/Unicode in perl
by jpvdv (Initiate) on Feb 04, 2008 at 14:27 UTC
    Thanks so much for looking into my problem!

    What I get when I dump the database data is the third possibility: (in both cases: isolated "î" and "îã")

    259: PV = 0x18d736c "\304\203"\0
    238: PV = 0x18ecffc "\303\256"\0
    That means it is UTF-8 all the time. So I've learned that the data from the base is correct. But the weird thing is: when I print the î to STDOUT (to get it to go to the browser) it is turned into xEE.... In a CMD box it shows as a Euro-sign (could be handy...) and in the browser it doesn't show but as a square, that is: Paletino Linotyope, the TTF that supports a big char repertoire, doesn't like it. If I look in the source of the HTML page FROM the browser it says î. But when it is accompanied by the ã, than it shows...

    If I would make a workaround to change \xEE back to \303\256 (and all the rest), I cannot, at that time, see the difference in the two î's - its only in OUTPUTTING, so it seems, that the isolated î is offered at the browser/STDOUT in iso-latin-1 form... can you shed more light on this?

      Printing shouldn't cause any conversion

      $ perl -MDevel::Peek -MEncode -we'$x=encode("UTF-8", chr(259)); Dump($ +x); print $x' | od -b SV = PV(0x819f9e0) at 0x814cc6c REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x81698e0 "\304\203"\0 ---. CUR = 2 \__ same LEN = 3 / 0000000 304 203 --------------' 0000002 $ perl -MDevel::Peek -MEncode -we'$x=encode("UTF-8", chr(238)); Dump($ +x); print $x' | od -b SV = PV(0x819f9e0) at 0x814cc6c REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x81698e0 "\303\256"\0 ---. CUR = 2 \__ same LEN = 3 / 0000000 303 256 --------------' 0000002

      How do you know it's outputting xEE?

      Are you using :encoding() on the STDOUT? You shouldn't with this data.

      Are you using CGI's HTML generation methods (print h1(text))? They do some encoding too.

        Thanks so very much for your time and patience.

        I get GOOD output now; what did the trick is setting
        binmode(STDOUT, ':encoding(utf8)');
        That stopped the strange behaviour. I think the "îâ" being displayed was rather the exception then the right thing. And it was the virgility of the Paletino Font to interpret the Wide Character, that was in the HTML-text, and not its restriction on not print î. On close watch I had gotten a Wide Character-warning as well....
        When I look at the source of the page as the browser received it I see
        î&#259; î
        rather than
        îă î
        The "îă" in the latter case `looked` right, but realy wasn't.... Thanks again for helping me sort this out .