in reply to XML? characters
Now, most likely, your data is not in the Unicode/Latin-1 encoding, but in Windows, also known as Code Page 1252. This character set is compatible with ISO-Latin-1 and Unicode for characters with ordinal <= 255, except it includes a few extra printable characters in the range 128 .. 159 — see the link I pointed to above, for the complete list.
Now the best thing to do is to either replace these entities by their proper numerical value in Unicode, or turn them into unencoded, raw characters, and explicitely declare that this XML file uses the Windows character set. You can do that by poking XML::Parser to treat the text as Windows text (see "ProtocolEncoding" in the XML::Parser docs), or you can add the proper declaration line at the top of the XML file, something lkike this:
<?xml version="1.0" encoding='XXX'?>
You might need an extra encoding file for XML::Parser for this to work, it's been too long since I last checked.
You can look at this old mailing list thread, featuring me, for some more info.
p.s. I do have a copy of a file "cp1252.enc", which can be used by XML::Parser so you can use "CP1252" as the encoding in the ways described above, which I generated myself. You're free to have a copy (it's a small binary file of 2k), there's no need for everyone to go through all the hassle to build it himself, as there's a lot of work involved. I also think I must stil lhave the script lying around I used to generate it, which I might post to this site some time in the near future.
|
|---|