in reply to Re: XML and entities, what am I doing wrong?
in thread XML and entities, what am I doing wrong?
XML::Twig uses the original_string method to keep the characters in the original encoding (but then it works only for 1-byte encodings as it uses a regexp to parse the start tag string to extract the tag name and the attributes). In order to track the entities (and not expand them) I use a Default handler that spots them and stores them as a special element.
The latest (still beta) version also comes with a bunch of filters, to convert the UTF-8 back to latin1, html-style text (using HTML::Entities), DOM-style ASCII + character entities or to any other encoding using either the Unicode::Map8 or (even better if the iconv library is installed on your system) Text::Iconv.
Overall using the original_string method, even though it is frowned upon as not being completely kosher is the easiest choice if (IF) you are using a 1-byte encoding. Dealing with the various cases on internal and external entities (depending on whether they are defined at the beginning of the document or in a separate file) is way trickier and entities within attributes are generally a huge pain to deal with using XML::Parser.
|
|---|