Re^2: Strange behaviour ODBC/Unicode in perl

Replies are listed 'Best First'.
Re^3: Strange behaviour ODBC/Unicode in perl by ikegami (Patriarch) on Feb 02, 2008 at 02:09 UTC
Noone else posted anything, so I'll give it a quick go... It seems you're assuming Perl's internal format is UTF-8. To be sure, it would help if I knew exactly what you were getting from the database. The "PV = " lines printed by following would provide that info. `use Devel::Peek qw( Dump ); Dump($db_field); # Output sent to STDERR` [download] You should get one of the following four output combinations for chr(259) and chr(238). `259: PV = 0x18e914c "\304\203"\0 [UTF8 "\x{103}"] 238: PV = 0x18e914c "\356"\0 259: PV = 0x18e914c "\304\203"\0 [UTF8 "\x{103}"] 238: PV = 0x18ecffc "\303\256"\0 [UTF8 "\x{ee}"] 259: PV = 0x18d73fc "?"\0 238: PV = 0x18d7444 "\356"\0 259: PV = 0x18d736c "\304\203"\0 238: PV = 0x18ecffc "\303\256"\0 259: PV = 0x18d74dc "\304\203"\0 <- Lack of [UTF8 ...] 238: PV = 0x18d74dc "\356"\0` [download] `259: PV = 0x18e914c "\304\203"\0 [UTF8 "\x{103}"] 238: PV = 0x18e914c "\356"\0` [download] or `259: PV = 0x18e914c "\304\203"\0 [UTF8 "\x{103}"] 238: PV = 0x18ecffc "\303\256"\0 [UTF8 "\x{ee}"]` [download] The data is in Perl's internal text format (called "UTF8", no dash). It can either be in iso-latin-1 or UTF-8. iso-latin-1 is preferred since it's faster, but it can only be used if every character in the string can be represented by iso-latin-1. Solution: You need to convert from Perl's internal format to a string of bytes. `use Encode qw( encode ); print(encode('UTF-8', $text));` [download] or `binmode(STDOUT, ':encoding(UTF-8)'); print($text);` [download] `259: PV = 0x18d73fc "?"\0 238: PV = 0x18d7444 "\356"\0` [download] The data is in iso-latin-1. This can't be the case, since "ă" is working. `259: PV = 0x18d736c "\304\203"\0 238: PV = 0x18ecffc "\303\256"\0` [download] The data is in UTF-8 already. Just print it out without re-encoding it. `259: PV = 0x18d74dc "\304\203"\0 <- Lack of [UTF8 ...] 238: PV = 0x18d74dc "\356"\0` [download] That means the data in your database is bad, or your database is returning bad data. The problem should be fixed at the source, but I think it's possible to fix it at this point too. Update: Added second pair of output combinations, and fixed s/utf8/utf-8/	[reply] [d/l] [select]
Re^4: Strange behaviour ODBC/Unicode in perl by jpvdv (Initiate) on Feb 04, 2008 at 14:27 UTC
Thanks so much for looking into my problem! What I get when I dump the database data is the third possibility: (in both cases: isolated "î" and "îă") 259: PV = 0x18d736c "\304\203"\0 238: PV = 0x18ecffc "\303\256"\0 That means it is UTF-8 all the time. So I've learned that the data from the base is correct. But the weird thing is: when I print the î to STDOUT (to get it to go to the browser) it is turned into xEE.... In a CMD box it shows as a Euro-sign (could be handy...) and in the browser it doesn't show but as a square, that is: Paletino Linotyope, the TTF that supports a big char repertoire, doesn't like it. If I look in the source of the HTML page FROM the browser it says î. But when it is accompanied by the ă, than it shows... If I would make a workaround to change \xEE back to \303\256 (and all the rest), I cannot, at that time, see the difference in the two î's - its only in OUTPUTTING, so it seems, that the isolated î is offered at the browser/STDOUT in iso-latin-1 form... can you shed more light on this?	[reply]
Re^5: Strange behaviour ODBC/Unicode in perl by ikegami (Patriarch) on Feb 05, 2008 at 00:44 UTC
Printing shouldn't cause any conversion $ perl -MDevel::Peek -MEncode -we'$x=encode("UTF-8", chr(259)); Dump($ +x); print $x' \| od -b SV = PV(0x819f9e0) at 0x814cc6c REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x81698e0 "\304\203"\0 ---. CUR = 2 \__ same LEN = 3 / 0000000 304 203 --------------' 0000002 $ perl -MDevel::Peek -MEncode -we'$x=encode("UTF-8", chr(238)); Dump($ +x); print $x' \| od -b SV = PV(0x819f9e0) at 0x814cc6c REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x81698e0 "\303\256"\0 ---. CUR = 2 \__ same LEN = 3 / 0000000 303 256 --------------' 0000002 [download] How do you know it's outputting xEE? Are you using `:encoding()` on the STDOUT? You shouldn't with this data. Are you using CGI's HTML generation methods (`print h1(text)`)? They do some encoding too.	[reply] [d/l] [select]
Re^6: Strange behaviour ODBC/Unicode in perl by jpvdv (Initiate) on Feb 05, 2008 at 08:11 UTC
Re^7: Strange behaviour ODBC/Unicode in perl by ikegami (Patriarch) on Feb 05, 2008 at 08:44 UTC