Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I'm parsing some XML files with XML::Twig, and getting the error below on some files, e.g.:

not well-formed (invalid token) at line 26, column 89, byte 4127 at C:/Perl64/lib/XML/Parser.pm line 187

It turns out it's failing on special characters such as the registered or copyright symbols.

I can use regex's to take those out (after stuffing the XML into a variable and using parse() instead of parsefile()), but is there a way I can tell Parser.pm to just ignore or accept them? I'm afraid I'd have to catch them all and might miss a few.

$content =~ s/®//g;

thanks,
Scott

Replies are listed 'Best First'.
Re: XML Parser error
by moritz (Cardinal) on Aug 26, 2011 at 12:05 UTC

    It's probably just a missing or wrong encoding declaration at the top of the XML file. Find out which character encoding the XML file is stored in, and modify the first line in the XML file to look like

    <?xml version="1.0" encoding="correct_encoding_here"?>

      Thanks for the quick reply -- but I believe the files are correctly encoded. They have this at the top:

      <?xml version="1.0" encoding="UTF-8"?>

      Scott

        but I believe the files are correctly encoded.

        Believes are a bad basis for debugging. Open the file in a hex editor (or use hexdump -C) and check if the bytes corresponding to the ® are C2 AE. Only then do you know that it's not an encoding issue, and then I'm willing to look into other possible causes.