Grundle has asked for the wisdom of the Perl Monks concerning the following question:

I am using XML::Simple to parse some XML data. The reason I like it is because it can parse generalized XML::Data without looking at a DTD file. One problem I am running into, however, is when I run into data containing something like the following.
$data = "<some_tag> blah blah blah &special; blah blah </some_tag>";
The &special; is causing XML::Simple to choke. When I do the following
$data =~ s/&(.*);//g;
I have no problems parsing my XML tree. Is there any way to force XML::Simple to ignore those types of characters? I noticed that there is an option labelled "NoEscape" that deals with those types of characters, but it is only for the XMLout method.

Any ideas?

Replies are listed 'Best First'.
Re: Simply Choking
by TedYoung (Deacon) on Feb 08, 2005 at 19:44 UTC

    The &foo; syntax has special meaning in XML. However, unlike HTML, XML does NOT predefine a bunch of named enties. So, if the entity has not been declared first, the XML parser won't know what it is. The only exceptions to these entities are &gt; &lt; &amp; &quot; &apos; and those in the form of &#x0A; etc.

    If you want to preserve the entity verbatim, you probably want to do this:

    $xml =~ s/&(?!gt;|lt;|amp;|quot;|apos;|#x\d+;)/&amp;/g;

    Basically, you are properly escaping all of the & that should not be in the XML. So &simple; would become &amp;simple; (boy this is hard to type in HTML).

    Sorry, the code is untested, but should work ;-)

    Update: Added the ; in the regex to make it a little more correct.

    Ted Young

    ($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)

      I'd give it a little extra tweak:

      my %entities = ( special => 1, case => 1, ); $xml =~ s/&(?=(\w+);)/exists $entities{$1} ? "&amp;" : die "Unknown en +tity '$1'"/ge;

      This gives you a fair amount of flexibility in defining your entities, but still leaves you with some way of catching bad ones if they get in. Of course, you could also just make the right side of the conditional simply leave the string untouched, and let XML::Simple catch the error, but this seems unkind. If admittedly kind of cute.



      If God had meant us to fly, he would *never* have given us the railroads.
          --Michael Flanders

      Thats some Perl-FU! Thanks, that is exactly what I needed. I never even thought of regex-ing my way around it.
Re: Simply Choking
by Roy Johnson (Monsignor) on Feb 08, 2005 at 20:18 UTC
    Note that your use of dot-star is greedy, and will absorb multiple amp-codes (and whatever is between them) on one line.

    Caution: Contents may have been coded under pressure.