in reply to XML::Parser and &entity;

Your data is still not well-formed XML. The only pre-defined entities in XML are <, &, >, " and '. Numerical entities, &#nb; or &#xnb; are also allowed. Everything else needs to be declared.

So you have several options here:

You could also internally use the entity declaration files and pre-process the XML to convert them to numerical entities. I can't think of an easy way to do this right now (except using xmlwf -p -d result_dir file.xml but then the output is in utf-8) but I'll have a look at it.

Update: fixed typo in doctype

Replies are listed 'Best First'.
Re: Re: XML::Parser and &entity;
by dingus (Friar) on Nov 26, 2002 at 18:00 UTC
    Just to get this 100% clear in my not very XMLized head.

    An valid XML file with no includes/inline entity definitions may contain:

    • Valid unicode utf-8 characters
    • <, &, >, ", '.
    • Numerical entities, &#nb; or &#xnb;
    and nothing else?

    The good news is that I do in fact control the source data file so I can do further mungeing. It looks like the best option is to utf-8 the file including convertig to utf-8 the entities that are not defined. Then, since the characters are in fact all valid latin-1 doing my favourite pack/unpack trick to convert UTF-8 back to latin-1 for the display

    sub utf8toNative() { my $c = pack("C*",unpack("U*",$_[0])); return ((length($c)==length($u))?$_[0]:$c);
    (You have to return the string unchanged if the lengths are the same as new string may be incorrect in such cases)

    Dingus


    Enter any 47-digit prime number to continue.

      utf-8-ing everything will indeed save you some headache. That will be playing along with the "XML Way", instead of fighting it. Just to be complete though: you can use an other encoding if you specify it in the xml declaration (<?xml version="1.0" encoding="ISO-8859-1"?>. XML::Parser based modules will nevertheless convert the input to utf-8 before passing it to your code.

Re: Re: XML::Parser and &entity;
by mirod (Canon) on Nov 26, 2002 at 17:46 UTC

    The easiest way I have found to resolve the entities, replacing them by the numerical entity, and to drop the now useledd DTD is to use xmllint, from libxml2:

    xmllint --noent --dropdtd <file.xml>