Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re: HTML::Entities and multi-byte characters

by iburrell (Chaplain)
on Sep 13, 2004 at 19:48 UTC ( [id://390650] : note . print w/replies, xml ) Need Help??

in reply to HTML::Entities and multi-byte characters

My impression is that you will always have problems with Unicode strings and Perl 5.6.

In my experience, HTML::Entities works just fine under Perl 5.8 with Unicode strings. The strings must be marked as Unicode, not just contain UTF-8 bytes. If the source isn't doing the conversion, you can do it manually with Encode.

print encode_entitites("a\x9B\x{263A}");

Replies are listed 'Best First'.
Re^2: HTML::Entities and multi-byte characters
by bpphillips (Friar) on Sep 13, 2004 at 20:35 UTC
    thanks for the tips. It does seem that 5.8 is *much* better at handling unicode strings. doing encode_entities("a \x{9B} \x{263A}") in 5.6 yields:
    a › ☺
    In 5.8 it yields:
    a › ☺
    which is what it should be.

    However, the string coming from the database (MySQL) still doesn't print correctly. I'm wasn't familiar with the Encode module that you mentioned but when I do a Dump (using Devel::Peek) on the string I pull from the database, I can see that it doesn't have the UTF8 flag that the string I create manually does. I tried doing a:
    my $str = decode_utf8($data);
    which worked splendidly and did exactly what I wanted it to. Do you know if this is SOP when working with MySQL? (i.e. will I have to do this on any string that I pull from the database?)
      You probably will have to make a Unicode string from strings that come from the database.

      Some drivers (DBD::Pg) will flag strings as Unicode. I don't know if DBD::mysql supports this. I have seen three different ways to control the encoding of strings. DBD::Pg has a dbh property, DBD::Oracle uses the NLS_LANG environment variable, and some use the database encoding. Unfortunately, it is not something that is well documented.

        I did a bit of googling and discovered that DBD::mysql doesn't support this but I found there's some ongoing discussion of how it should be emulated: Google Groups Thread. We use our own simple DBH abstraction layer so I might just add functionality at that level to do the decode_utf8() conversion...