artist has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks
I am trying to parse XML documents. I see that it has some incorrect characters and while using XML::Parser I am getting errors such as
Character reference & #130; refers to an illegal XML character (\202) Ln: 109, Col: 38

Malformed UTF-8 character (unexpected non-continuation byte 0x00 after start\
 byte 0xe1) in pattern match (m//) at /usr/local/lib/perl5/site_perl/5.6.1/X\
ML/SAX/PurePerl.pm line 383.
Character reference & #225; refers to an illegal XML character (\341) Ln: 49\
, Col: 40
If I fix each and every such
& 225;
characters, the parser works fine. I am sure, there must be some standard method.
(Note:) There is an intentional space after '&' sign, not to prevent the interpretation of the characters.

Thanks,
artist

Update: The header tells me that the XML data is in UTF-8 format.

<?xml version="1.0" encoding="UTF-8"?>

Update2: I used XML::DOM and the problem doesn't exist any more.

Replies are listed 'Best First'.
Re: XML? characters
by bart (Canon) on Jan 12, 2004 at 20:47 UTC
    The numerical value is the character code in Unicode/ISO 10646. The range 128 .. 159 is, in Unicode, reserved for control characters, actually mirrors for the range 0 .. 31. In practise, XML parsers reject the range 128 .. 159. That is what you see.

    Now, most likely, your data is not in the Unicode/Latin-1 encoding, but in Windows, also known as Code Page 1252. This character set is compatible with ISO-Latin-1 and Unicode for characters with ordinal <= 255, except it includes a few extra printable characters in the range 128 .. 159 — see the link I pointed to above, for the complete list.

    Now the best thing to do is to either replace these entities by their proper numerical value in Unicode, or turn them into unencoded, raw characters, and explicitely declare that this XML file uses the Windows character set. You can do that by poking XML::Parser to treat the text as Windows text (see "ProtocolEncoding" in the XML::Parser docs), or you can add the proper declaration line at the top of the XML file, something lkike this:

    <?xml version="1.0" encoding='XXX'?>

    You might need an extra encoding file for XML::Parser for this to work, it's been too long since I last checked.

    You can look at this old mailing list thread, featuring me, for some more info.

    p.s. I do have a copy of a file "cp1252.enc", which can be used by XML::Parser so you can use "CP1252" as the encoding in the ways described above, which I generated myself. You're free to have a copy (it's a small binary file of 2k), there's no need for everyone to go through all the hassle to build it himself, as there's a lot of work involved. I also think I must stil lhave the script lying around I used to generate it, which I might post to this site some time in the near future.