XML and entities, what am I doing wrong?

kevin_i_orourke has asked for the wisdom of the Perl Monks concerning the following question:

I think I'm suffering from a basic misunderstanding here, so please feel free to tell me I'm stupid so long as you then explain why I'm stupid.

This is a fairly long post as I want to try to explain what I'm trying to do before I get to the question.

I have the beginnings of a website and I want to add more content. However I want to make adding things easy, so that I don't have to keep cutting and pasting headers and footers when I change something.

I thought this would be an ideal job for Perl, so I converted all the files to XHTML using HTML Tidy. I've then tried various modules to process the files.

If you look at the site you can see that some of the places and people mentioned have non-English names, containing characters such as ö, ç and é.

The XML modules I've tried so far mostly just delete these entities. XML::Grove seems to be the best, converting them to ÃX, where X is another odd character.

This is where I get to the point:why are the other modules just deleting the entities? Do I need to keep a version of the XHTML DTD locally for the modules to refer to? Do I need to be supplying some special options to XML::Parser or XML::Parser::PerlSAX?

If you really want to see some example code let me know, I have to transfer files between work (here) and home (where I'm playing with XML) on Zip disks

:-(

--
Kevin O'Rourke

Comment on XML and entities, what am I doing wrong?

Replies are listed 'Best First'.
Re: XML and entities, what am I doing wrong? by mirod (Canon) on Jun 08, 2001 at 14:16 UTC
Welcome to the wonderful world of XML! I can't figure out exactly what is your original format but I will nevertheless go for the shameless plug:. <shameless_plug>XML::Twig will happily deal with this problem. get the latest version (3.00) from here and you won't have to bother with entities being dropped.</shameless_plug> Try playing with this code (with and without the `keep_encoding` option for example): `#!/bin/perl -w use strict; use XML::Twig; my $t= new XML::Twig( keep_encoding => 1); { $/= ''; while( <DATA>) { $t->parse( $_); $t->print; print "\n"; } } __DATA__ <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE doc SYSTEM "dummy"[]> <doc att="valué ">A document with text in latin1: soupçonné d'être</do +c> <?xml version="1.0"?> <!DOCTYPE doc SYSTEM "dummy"[]> <doc att="valué">A document with text in latin1:soupçonn +é d'etre</doc>` [download]	[reply] [d/l]
Re: XML and entities, what am I doing wrong? by gildir (Pilgrim) on Jun 08, 2001 at 17:19 UTC
The real problem is 'The Unicode' Most Perl XML modules are built on top of expat, or XML::Parser which is an interface to expat. Expat is XML parser. It will get your XML (XHTML) document and process its tags and so on. But as XML is fundamentaly based on unicode, expat will convert all your characters to unicode. For this conversion to work properly, you should have valid encoding specified in XML header: `<?xml version='1.0' encoding='iso-8859-2'?>` This is the primary reason for these odd charaters you encounter. They are utf-8 (8-bit Unicode) representation of non-english characters. You probably want to avoid this coversion. I have similar problem maybe a year ago, but found no useful solution. XML::Parser has a `original_string` method which returns character data in original encding, but it wont expand entities. And there is no way to get attributes in original encoding. Best solution around this is to use Unicode::Map8 to map all unicode strings back to their original encodig, but this is terribly slow solution for frequent use. So I wrote my own poor man's XML parser based on Perl patterns. But it is not a solution, but a hack. If you plan to use XML, use should better move to Unicode completly. PS: I wonder how XML::Twig implements its keep_encoding option. By forcing expat to behave reasonably or by back conversion to original charset?	[reply] [d/l] [select]
Re: Re: XML and entities, what am I doing wrong? by mirod (Canon) on Jun 08, 2001 at 17:59 UTC
XML::Twig uses the `original_string` method to keep the characters in the original encoding (but then it works only for 1-byte encodings as it uses a regexp to parse the start tag string to extract the tag name and the attributes). In order to track the entities (and not expand them) I use a `Default` handler that spots them and stores them as a special element. The latest (still beta) version also comes with a bunch of filters, to convert the UTF-8 back to latin1, html-style text (using HTML::Entities), DOM-style ASCII + character entities or to any other encoding using either the Unicode::Map8 or (even better if the `iconv` library is installed on your system) Text::Iconv. Overall using the `original_string` method, even though it is frowned upon as not being completely kosher is the easiest choice if (IF) you are using a 1-byte encoding. Dealing with the various cases on internal and external entities (depending on whether they are defined at the beginning of the document or in a separate file) is way trickier and entities within attributes are generally a huge pain to deal with using XML::Parser.	[reply]