Hello,
I work for a company that use HTML::Mason as a framework for it's web sites. We are working on implementing multi-lingual capabilities for our infrastructure and this is our first foray into using non-ASCII characters. We use HTML::Mason's default html encoding (which uses HTML::Entities to encode special characters into HTML entities) when writing text to the browser. I've had a lot of problems with special characters (such as an e with an acute acent which in utf8 is represented as two bytes: 0xC3 0xA9).

HTML::Entities uses a regular expression to do it's substition:
s/([^\n\r\t !\#\$%\'-;=?-~])/$char2entity{$1} || num_entity($1)/ge
If I change that regex to include a utf8 character in the pattern (which according to the "Important Caveats" of perldoc's perlunicode page makes the regex compiler recognize multi-byte characters), it works:
my $foo = "\x{263A}"; $$ref =~ s/([^\n\r\t !\#\$%\'-;=?-~]|$foo)/$char2entity{$1} || num_ent +ity($1)/ge;
In all the tests I've done (perl v5.6.1 and v5.8.1), the first regular expression only ever matches the first byte of the character rather than both bytes. I hate patching stock modules like this because they become very hard to maintain.

Does anyone know if there's any other way to get around this limitation? If possible, I'd rather not pass in a list of explicit characters to encode but so far that's the only thing I've come up with.

I know this has been discussed on this site before (HTML::Entities and UTF-8, strange behavior with HTML::Entities and HTML::Entities question) but I thought the question was worth posing again to see if anyone had any more input.

Thanks for any comments you might have

In reply to HTML::Entities and multi-byte characters by bpphillips

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.