saintmike has asked for the wisdom of the Perl Monks concerning the following question:

I've looked all over CPAN, but I can't find a module that offers a standard way of performing this simple transformation.

To transform an XML entity like

ü
into the corresponding utf-8-encoded Unicode character, the following substitution can be used, given that the string this is performed on is a Unicode string:
s/&#x([0-9a-f]+);/chr(hex($1))/ige
Now, instead of having to write this snippet down over and over again, I'd prefer something like
use URI::Escape; $str = uri_unescape($safe);
which I use all the time, not because URL-unescaping is terribly complicated, but because for such a common operation there ought to be a standard procedure.

So ... is there a module on CPAN that does something similar? If not, I'll be happy to put one up there.

By the way, XML::DOM provides a function called XmlUtf8Encode which does a lot more than calling chr(), but I guess that's because it tries to cope with older perl releases that didn't support Unicode well. Any insight on this would be appreciated as well.

(Hex-entity corrected, thanks eserte.)

Replies are listed 'Best First'.
Re: Decode XML &#xxxx; entities
by eserte (Deacon) on Dec 04, 2007 at 20:12 UTC
    You seem to confuse HTML entities with URI escaping. You should use HTML::Entities, not URI::Escape. Also, a HTML entities never looks like &#00FC; It's hexadecimal, so there must be an "x" in between, or keep it decimal. With HTML::Entities I get the expected result:
    use HTML::Entities qw(decode_entities); use Devel::Peek; Dump decode_entities "ü"; Dump decode_entities "€"; __END__ SV = PV(0x5060c8) at 0x5051e8 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0x510920 "\374"\0 CUR = 1 LEN = 16 SV = PV(0x5060c8) at 0x5051e8 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x510920 "\342\202\254"\0 [UTF8 "\x{20ac}"] CUR = 3 LEN = 16
    Note that the first result does not have the utf-8 flag on, but for Perl this does not matter if a codepoint < 256 is internally encoded as latin1 or utf8.
      You seem to confuse HTML entities with URI escaping.

      Nope, I was just giving an example of a similarily trivial transformation that's covered by a CPAN module.

      Also, a HTML entities never looks like &#00FC; It's hexadecimal, so there must be an "x" in between, or keep it decimal.

      Thanks, corrected in my original post.

      With HTML::Entities I get the expected result:

      Looks pretty good!

Re: Decode XML &#xxxx; entities
by moritz (Cardinal) on Dec 04, 2007 at 18:20 UTC
    I didn't test it, but I'd expect XML::Entities::decode function to do exactly that.

    If not, that's the right module to extend.

      Not sure if I want to use a module which fails to run the code in its own SYNOPSIS section.
        I agree that it looks unreliable.

        Maybe you should write a bug report, or contact the author - it's a very young module.

        I'm a bit surprised that HTML::Entites can't handle it either (at least that's what a very shallow test of mine showed).

        I search a bit on CPAN for xml entities and found nothing really good.

        So maybe your contributions are really needed ;-)