in reply to Regex to encode entities in XML
Just to follow-up on this problem:
The problem with data coming from a browser is often that XML::Simple cannot load a file because XML::Parser normally expects a UTF-8 encoded document and die when fed latin1 characters of HTML entities.
My quick'n dirty trick in this case is to add the following at the top of the document:
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE CHATTER SYSTEM "dummy.dtd" []>
The first line is the XML declaration, which in this case includes an encoding declaration that tells XML::Parser to accept latin1 characters. You can also use the ProtocolEncoding option in XML::Parser to get the same result.This takes care of characters above 127 (note that this will not be of much help if lexicon starts posting Japanese characters in shift-JIS for example)
The second line takes care of HTML entities. By declaring a fake Document Type Definition (DTD) we tell the parser that entities might be defined in an external file. The file does not even have to exist, XML::Parser will not try to open it by default, but the effect is that if will not complain about undefined entities.
Of course then XML::Parser will happily convert characters above 127 to UTF-8 and we have to resort to tricks to convert them back to latin1, but at least we have loaded the document and we can work with it.
|
|---|