Re^2: HTML::Entities and multi-byte characters

thanks for the tips. It does seem that 5.8 is *much* better at handling unicode strings. doing encode_entities("a \x{9B} \x{263A}") in 5.6 yields:

a &Acirc;&#155; &acirc;&#152;&ordm;
[download]

In 5.8 it yields:

a &#155; &#x263A;
[download]

which is what it should be.

However, the string coming from the database (MySQL) still doesn't print correctly. I'm wasn't familiar with the Encode module that you mentioned but when I do a Dump (using Devel::Peek) on the string I pull from the database, I can see that it doesn't have the UTF8 flag that the string I create manually does. I tried doing a:

my $str = decode_utf8($data);
[download]

which worked splendidly and did exactly what I wanted it to. Do you know if this is SOP when working with MySQL? (i.e. will I have to do this on any string that I pull from the database?)

Comment on Re^2: HTML::Entities and multi-byte characters Select or Download Code

Replies are listed 'Best First'.
Re^3: HTML::Entities and multi-byte characters by iburrell (Chaplain) on Sep 13, 2004 at 22:07 UTC
You probably will have to make a Unicode string from strings that come from the database. Some drivers (DBD::Pg) will flag strings as Unicode. I don't know if DBD::mysql supports this. I have seen three different ways to control the encoding of strings. DBD::Pg has a dbh property, DBD::Oracle uses the NLS_LANG environment variable, and some use the database encoding. Unfortunately, it is not something that is well documented.	[reply]
Re^4: HTML::Entities and multi-byte characters by bpphillips (Friar) on Sep 14, 2004 at 14:31 UTC
I did a bit of googling and discovered that DBD::mysql doesn't support this but I found there's some ongoing discussion of how it should be emulated: Google Groups Thread. We use our own simple DBH abstraction layer so I might just add functionality at that level to do the decode_utf8() conversion...	[reply]