Decode XML &#xxxx; entities

saintmike has asked for the wisdom of the Perl Monks concerning the following question:

I've looked all over CPAN, but I can't find a module that offers a standard way of performing this simple transformation.

To transform an XML entity like

    &#x00FC;
[download]

into the corresponding utf-8-encoded Unicode character, the following substitution can be used, given that the string this is performed on is a Unicode string:

    s/&#x([0-9a-f]+);/chr(hex($1))/ige
[download]

Now, instead of having to write this snippet down over and over again, I'd prefer something like

    use URI::Escape;
    $str  = uri_unescape($safe);
[download]

which I use all the time, not because URL-unescaping is terribly complicated, but because for such a common operation there ought to be a standard procedure.

So ... is there a module on CPAN that does something similar? If not, I'll be happy to put one up there.

By the way, XML::DOM provides a function called XmlUtf8Encode which does a lot more than calling chr(), but I guess that's because it tries to cope with older perl releases that didn't support Unicode well. Any insight on this would be appreciated as well.

(Hex-entity corrected, thanks eserte.)

Comment on Decode XML &#xxxx; entities Select or Download Code

Replies are listed 'Best First'.
Re: Decode XML &#xxxx; entities by eserte (Deacon) on Dec 04, 2007 at 20:12 UTC
You seem to confuse HTML entities with URI escaping. You should use HTML::Entities, not URI::Escape. Also, a HTML entities never looks like `&#00FC;` It's hexadecimal, so there must be an "x" in between, or keep it decimal. With HTML::Entities I get the expected result: `use HTML::Entities qw(decode_entities); use Devel::Peek; Dump decode_entities "ü"; Dump decode_entities "€"; __END__ SV = PV(0x5060c8) at 0x5051e8 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0x510920 "\374"\0 CUR = 1 LEN = 16 SV = PV(0x5060c8) at 0x5051e8 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x510920 "\342\202\254"\0 [UTF8 "\x{20ac}"] CUR = 3 LEN = 16` [download] Note that the first result does not have the utf-8 flag on, but for Perl this does not matter if a codepoint < 256 is internally encoded as latin1 or utf8.	[reply] [d/l] [select]
Re^2: Decode XML &#xxxx; entities by saintmike (Vicar) on Dec 04, 2007 at 21:07 UTC
You seem to confuse HTML entities with URI escaping. Nope, I was just giving an example of a similarily trivial transformation that's covered by a CPAN module. Also, a HTML entities never looks like &#00FC; It's hexadecimal, so there must be an "x" in between, or keep it decimal. Thanks, corrected in my original post. With HTML::Entities I get the expected result: Looks pretty good!	[reply]
Re: Decode XML &#xxxx; entities by moritz (Cardinal) on Dec 04, 2007 at 18:20 UTC
I didn't test it, but I'd expect XML::Entities::`decode` function to do exactly that. If not, that's the right module to extend.	[reply] [d/l]
Re^2: Decode XML &#xxxx; entities by saintmike (Vicar) on Dec 04, 2007 at 18:33 UTC
Not sure if I want to use a module which fails to run the code in its own SYNOPSIS section.	[reply]
Re^3: Decode XML &#xxxx; entities by moritz (Cardinal) on Dec 04, 2007 at 18:51 UTC
I agree that it looks unreliable. Maybe you should write a bug report, or contact the author - it's a very young module. I'm a bit surprised that HTML::Entites can't handle it either (at least that's what a very shallow test of mine showed). I search a bit on CPAN for xml entities and found nothing really good. So maybe your contributions are really needed ;-)	[reply]