Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

So,
XML::Twig is a great mod to use when chopping upa 500MB file without thrashing your machine. But, what I've come acros is a case where midway through that 500MB file, I get a line of XML with a bad character. It is something like this

<keywords>cat, dog, mouse ^\, car, plane, truck</keywords>

I'm parsing the file with this method:
$twig->parsefile( $filehandle );

When the bad character is discovered, the parsing abruptly stops. Ive also tried the safe_parsefile method, but that just stops without printing the error. I've also tried wrapping the whole thing in an eval, but that doesn't do much either. So, the question is:

Has anyone found a way to just skip this chunk of xml and continue happily along?

Regards,
Toby

Replies are listed 'Best First'.
Re: XML::Twig Error Recovery
by Joost (Canon) on Nov 04, 2004 at 17:36 UTC
    The only way to get this kind of data through most XML parsers is to remove or replace the offending character first.

    XML has strict rules about what characters are allowed (independand of character encoding, it seems), and anything that doesn't conform is not considered to be XML and parsers should fail when they encounter invalid XML. The theory is that it's better to croak on parsing than to let through a potentionally corrupt file. Besides that, the parser has no way of knowing what the intended character should be anyway.

    If you want to push binary data through, maybe using a &#number; code will help. I haven't tried it.