artist has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am using XML::Simple and trying to parse a file. Here is my sample code.
use XML::Simple; use strict; use warnings; $| += 1; opendir TMP,'/u1/tmptest'; my @files = grep /xml$/, grep { - r } readdir TMP; chdir '/u1/tmptest'; my $flag = 0; foreach my $file (@files){ my $tmp_xml = XMLin("./$file"); process($tmp_xml); } sub process { my $xml = shift; my $title = $xml->{head}->{title}; .... }
It gives me error:
Character reference & #133; refers to an illegal XML character (\205) 

When I checked the particular file, it has & #133; character inside.

What I like to do is either correct the problem and proceed or ignore the file. I don't have control over the source files. If I don't process this file it gives error for some other similar character .. ex: & #146;

In CB, podmaster pointed out that it XML::Simple which is based on XML::Parser dies upon invalid xml. I like to find some solution at this point.

Thanks,
artist

(Note: In the above examples & and # should not have space between them.. it's just that PM tries to interprete the character.)

Replies are listed 'Best First'.
Re: XML::Simple exit problem
by mirod (Canon) on Dec 18, 2003 at 22:03 UTC

    The standard way to avoid dying is to wrap the call to XMLin within an eval block:

    eval { $tmp_xml= XMLin("./$file"); } ; if( $@) { # something bad happened print "error in $file: $@\n"; } else { process( $tmp_xml); }

    Just a few remarks: XML::Simple is not necessarily based on XML::Parser. If it can find a SAX parser around (XML::SAX::PurePerl or XML::LibXML) it will use it. And the above code does not work on my machine, as the parser does not seem to die when wrapped in an eval block (but dies properly when not wrapped...).

    The proper way to do this is probably to have a separate step, where you parse the XMl through a simple XML checker (xmlwf comes with expat and xmllint with libxml2) before running your process, knowing that it will work on well-formed XML.

Re: XML::Simple exit problem
by neuroball (Pilgrim) on Dec 19, 2003 at 04:41 UTC
    artist,

    your problem seems to be that you have HTML-encoded characters in your xml feed.

    The problem is that the only HTML-encoded characters allowed in XML are & (&amp;), < (&lt;), and > (&gt;). Everything else has to be a Unicode character (E.g. &#8026;). Btw. not all Unicode characters are permitted to be used.

    There is a way though to get around all of this: Get your data/characters encased in a CDATA tag.

    Now back to your problems and a few solutions

    1. It seems that you get unvalid XML.
    2. To get around this you have two possibilities:
      • Go to your XML source and ask them to verify their XML before they give it to you.
      • Preparse the XML and use regular expressions to eiter replace the unvalid character entities, or encase the data in CDATA tags.

    thanks
    /oliver/

    Update: removed 'either' from sentence without second option.