in reply to Re: XML file won't parse properly
in thread XML file won't parse properly

I'm sorry... I didn't give complete information in my previous post.

The error message is:
not well-formed at line 18208, column 71, byte 511770:
I can't do anything about the encoding of the file because its rather large and it comes from a third-party

I can't paste the characters here, maybe I can describe them.. they look like French characters, with diacritics... there are also some ASCII like characters, like arrows, and so on.. not normal text...

I rather suspect that you're right, and its an encoding issue.. is there any way to sanitize the text before I put it into the XML::Parser module ? Thanks..

Replies are listed 'Best First'.
Re: Re: Re: XML file won't parse properly
by gregor42 (Parson) on Apr 12, 2001 at 21:36 UTC

    OK, sorry, we must have cross-posted, because this wasn't listed when I initially replied.

    It does indeed appear that you have a possibly mal-formed XML file.

    I should point out that those are probably not ASCII characters, unless the document specifically states such in the initial string... <?xml version="1.0" encoding="ISO-8859-1"?> AFAIK, it's usually UTF-8...

    I would most certainly check with the source of your data, since it's possible the file is corrupt... Also, if this is common, they should be having probelms with whomever they're sending these files to.

    With regards to pre-filtering, you want to be VERY careful with this. Isolate ONLY those characters that are causing the parser to barf & 1)try escaping them, 2)try commenting them out, and only if that doesn't work then 3)try replacing them with whitespace.

    But since this is seemingly a question of mal-formedness, none of those approaches are sure to work...



    Wait! This isn't a Parachute, this is a Backpack!

      gregor, I think its not well formedness in that start tags not equal to end tags or anything...

      In the code that I posted below, I set error context, and I get the place where the error is supposed to occur.. and this is inside the CDATA section, or in the embedded text...

      I also don't think its (the file) is corrupt, because I extract it from a zip, and the unzip gives no errors..Unfortunately, the header only says <xml version="1.0"> no news of what encoding it is...

      The only option that I see is to do a search and replace to strip out the control characters before I parse.. I really don't see any other choice, but I'm willing to listen to anyone who tells me this is too drastic :)

      Thanks