HTML::Entities and multi-byte characters

bpphillips has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I work for a company that use HTML::Mason as a framework for it's web sites. We are working on implementing multi-lingual capabilities for our infrastructure and this is our first foray into using non-ASCII characters. We use HTML::Mason's default html encoding (which uses HTML::Entities to encode special characters into HTML entities) when writing text to the browser. I've had a lot of problems with special characters (such as an e with an acute acent which in utf8 is represented as two bytes: 0xC3 0xA9).

HTML::Entities uses a regular expression to do it's substition:

s/([^\n\r\t !\#\$%\'-;=?-~])/$char2entity{$1} || num_entity($1)/ge
[download]

If I change that regex to include a utf8 character in the pattern (which according to the "Important Caveats" of perldoc's perlunicode page makes the regex compiler recognize multi-byte characters), it works:

my $foo = "\x{263A}";
$$ref =~ s/([^\n\r\t !\#\$%\'-;=?-~]|$foo)/$char2entity{$1} || num_ent
+ity($1)/ge;
[download]

In all the tests I've done (perl v5.6.1 and v5.8.1), the first regular expression only ever matches the first byte of the character rather than both bytes. I hate patching stock modules like this because they become very hard to maintain.

Does anyone know if there's any other way to get around this limitation? If possible, I'd rather not pass in a list of explicit characters to encode but so far that's the only thing I've come up with.

I know this has been discussed on this site before (HTML::Entities and UTF-8, strange behavior with HTML::Entities and HTML::Entities question) but I thought the question was worth posing again to see if anyone had any more input.

Thanks for any comments you might have

Comment on HTML::Entities and multi-byte characters Select or Download Code

Replies are listed 'Best First'.
Re: HTML::Entities and multi-byte characters by iburrell (Chaplain) on Sep 13, 2004 at 19:48 UTC
My impression is that you will always have problems with Unicode strings and Perl 5.6. In my experience, HTML::Entities works just fine under Perl 5.8 with Unicode strings. The strings must be marked as Unicode, not just contain UTF-8 bytes. If the source isn't doing the conversion, you can do it manually with Encode. `print encode_entitites("a\x9B\x{263A}");` [download]	[reply] [d/l]
Re^2: HTML::Entities and multi-byte characters by bpphillips (Friar) on Sep 13, 2004 at 20:35 UTC
thanks for the tips. It does seem that 5.8 is much better at handling unicode strings. doing `encode_entities("a \x{9B} \x{263A}")` in 5.6 yields: `a Â âº` [download] In 5.8 it yields: `a ☺` [download] which is what it should be. However, the string coming from the database (MySQL) still doesn't print correctly. I'm wasn't familiar with the Encode module that you mentioned but when I do a Dump (using Devel::Peek) on the string I pull from the database, I can see that it doesn't have the UTF8 flag that the string I create manually does. I tried doing a: `my $str = decode_utf8($data);` [download] which worked splendidly and did exactly what I wanted it to. Do you know if this is SOP when working with MySQL? (i.e. will I have to do this on any string that I pull from the database?)	[reply] [d/l] [select]
Re^3: HTML::Entities and multi-byte characters by iburrell (Chaplain) on Sep 13, 2004 at 22:07 UTC
You probably will have to make a Unicode string from strings that come from the database. Some drivers (DBD::Pg) will flag strings as Unicode. I don't know if DBD::mysql supports this. I have seen three different ways to control the encoding of strings. DBD::Pg has a dbh property, DBD::Oracle uses the NLS_LANG environment variable, and some use the database encoding. Unfortunately, it is not something that is well documented.	[reply]
Re^4: HTML::Entities and multi-byte characters by bpphillips (Friar) on Sep 14, 2004 at 14:31 UTC