in reply to Re^3: Strange behaviour ODBC/Unicode in perl
in thread Strange behaviour ODBC/Unicode in perl

Thanks so much for looking into my problem!

What I get when I dump the database data is the third possibility: (in both cases: isolated "î" and "îã")

259: PV = 0x18d736c "\304\203"\0
238: PV = 0x18ecffc "\303\256"\0
That means it is UTF-8 all the time. So I've learned that the data from the base is correct. But the weird thing is: when I print the î to STDOUT (to get it to go to the browser) it is turned into xEE.... In a CMD box it shows as a Euro-sign (could be handy...) and in the browser it doesn't show but as a square, that is: Paletino Linotyope, the TTF that supports a big char repertoire, doesn't like it. If I look in the source of the HTML page FROM the browser it says î. But when it is accompanied by the ã, than it shows...

If I would make a workaround to change \xEE back to \303\256 (and all the rest), I cannot, at that time, see the difference in the two î's - its only in OUTPUTTING, so it seems, that the isolated î is offered at the browser/STDOUT in iso-latin-1 form... can you shed more light on this?
  • Comment on Re^4: Strange behaviour ODBC/Unicode in perl

Replies are listed 'Best First'.
Re^5: Strange behaviour ODBC/Unicode in perl
by ikegami (Patriarch) on Feb 05, 2008 at 00:44 UTC

    Printing shouldn't cause any conversion

    $ perl -MDevel::Peek -MEncode -we'$x=encode("UTF-8", chr(259)); Dump($ +x); print $x' | od -b SV = PV(0x819f9e0) at 0x814cc6c REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x81698e0 "\304\203"\0 ---. CUR = 2 \__ same LEN = 3 / 0000000 304 203 --------------' 0000002 $ perl -MDevel::Peek -MEncode -we'$x=encode("UTF-8", chr(238)); Dump($ +x); print $x' | od -b SV = PV(0x819f9e0) at 0x814cc6c REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x81698e0 "\303\256"\0 ---. CUR = 2 \__ same LEN = 3 / 0000000 303 256 --------------' 0000002

    How do you know it's outputting xEE?

    Are you using :encoding() on the STDOUT? You shouldn't with this data.

    Are you using CGI's HTML generation methods (print h1(text))? They do some encoding too.

      Thanks so very much for your time and patience.

      I get GOOD output now; what did the trick is setting
      binmode(STDOUT, ':encoding(utf8)');
      That stopped the strange behaviour. I think the "îâ" being displayed was rather the exception then the right thing. And it was the virgility of the Paletino Font to interpret the Wide Character, that was in the HTML-text, and not its restriction on not print î. On close watch I had gotten a Wide Character-warning as well....
      When I look at the source of the page as the browser received it I see
      îă î
      rather than
      îă î
      The "îă" in the latter case `looked` right, but realy wasn't.... Thanks again for helping me sort this out .

        On close watch I had gotten a Wide Character-warning as well....

        That doesn't jive with what you said earlier. To get a wide character warning, you need to have a wide character, yet you said the output you got from Dump didn't have [UTF8 "..."], so no wide characters.

        binmode(STDOUT, ':encoding(utf8)');

        binmode(STDOUT, ':encoding(utf8)');
        is a speed hack for
        binmode(STDOUT, ':encoding(utf-8)');
        The former skips some checks, but doing so opens up a security vulnerability. Don't use the former on untrusted text. In fact, don't use the former.
        (I mistakenly used utf8 in my earlier post, sorry)

        When I look at the source of the page as the browser received it I see

        I wouldn't use view source for this at all. Look at the actual bytes of the source. You should see two bytes for each of those chars if the page uses the UTF-8 charset.