in reply to XML::Simple and problem characters

The &foo; syntax has special meaning in XML. However, unlike HTML, XML does NOT predefine a bunch of named enties. So, if the entity has not been declared first, the XML parser won't know what it is. The only exceptions to these entities are > < & " ' and those in the form of 
 etc.

If you want to preserve the entity verbatim, you probably want to do this:

$xml =~ s/&(?!gt;|lt;|amp;|quot;|apos;|#x\d+;)/&/g;

Basically, you are properly escaping all of the & that should not be in the XML. So &simple; would become &simple; (boy this is hard to type in HTML).

Sorry, the code is untested, but should work ;-)

Update: Added the ; in the regex to make it a little more correct.

Ted Young

($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)

Replies are listed 'Best First'.
Re^2: Simply Choking
by ChemBoy (Priest) on Feb 08, 2005 at 21:38 UTC

    I'd give it a little extra tweak:

    my %entities = ( special => 1, case => 1, ); $xml =~ s/&(?=(\w+);)/exists $entities{$1} ? "&amp;" : die "Unknown en +tity '$1'"/ge;

    This gives you a fair amount of flexibility in defining your entities, but still leaves you with some way of catching bad ones if they get in. Of course, you could also just make the right side of the conditional simply leave the string untouched, and let XML::Simple catch the error, but this seems unkind. If admittedly kind of cute.



    If God had meant us to fly, he would *never* have given us the railroads.
        --Michael Flanders

Re^2: Simply Choking
by Grundle (Scribe) on Feb 08, 2005 at 19:52 UTC
    Thats some Perl-FU! Thanks, that is exactly what I needed. I never even thought of regex-ing my way around it.