dobrozam has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I'm hoping someone may be of some assistance. I am working with XML::Parser::Expat mod and I'm having trouble resolving external entities using the ExternEnt Handler. My whole purpose is to resolve each entity to it's ASCII equivalent OR strip out the entity if it's no in the my entity list. I'm declaring the external entities in my XML files like so:
  <!DOCTYPE name SYSTEM "[some path]/entity.list">
The entity.list has all my entities. One example is:
  <!ENTITY &eacute "e">
When I parse each XML file the ExternEnt Handle is not being hit or called. I'm assuming it's not recognizing my file of entities. Am I missing something? Do you have any insight as to what I'm doing wrong?

What I did do was take some of the entities in my list and place them directly in my XML file like so:
<!DOCTYPE name SYSTEM "/[some path]/entity.list" [ <!ENTITY &eacute "e"> ]>
Then I used the Entity Handler, and it worked! The entity resolved to the appropriate value. This is not the ideal method for me to used though because I have no control in creating the XML file.
I hope I have given enough detail and I hope that you are able to help OR point me in the right direction to come up with the answer to my problem. I would greatly appreciate any advice or comments. Thanks!

Adam J. Dobrozsi

Replies are listed 'Best First'.
Re: XML::Parser::Expat Question
by mirod (Canon) on Aug 18, 2004 at 22:05 UTC

    I believe that XML::Parser normal behavior is to use the internal subset (in the document itself), but to ignore the external subset (anything defined in the DTD).

    XML::LibXML might give you more flexibility. If all you want to do is replace those entities, you might be able to do it by just using the appropriate option with xmllint, which comes with libxml2.

    And of course XML::Twig will let you do this ;--)

    #!/usr/bin/perl -w use strict; use XML::Twig; XML::Twig->new( expand_external_ents => 1, pretty_print => 'indented') ->parsefile( "test_ext_ent.xml") ->print;

    Note that in the current version of XML::Twig (3.15) entities used in attribute values will silently disappear (XML::Parser is not very cooperative there either). This is fixed in the development version that's... on my laptop. Let me know if you need it and I will upload it to the XML::Twig page.

      Interesting you mentioned that entities in attributes silently disappear...it was going to be my next point. The XML I'm parsing has a lot information in the attributes that I need and most of the attributes have entities I need resolved.

      Let me think about what to do. I already have most of the parsing code written and my deadline is coming up soon....so I don't know if I'm going to be able to switch over to XML::Twig in time. I will look over the XML::Twig mod and learn more about it. I do have a solution in place to that opens the XML file before the Expat parse and resolves the entities before hand. I don't like to do that, but it solves my problem.

      Thanks for all your help. Greatly appreciated! I'm been pondering a solution to this problem within Expat for some time. Since Expat provided a ExternEnt Handler I was assumed I was doing something wrong! What would you use the ExternEnt handler for then???

      Thanks again!!!

      Adam

        If I understand XML::Parser, the ExternEnt handler is used for entities that refer to external files, but I don'think there is any built-in way to get to the DTD, and to the info inside it.

        Actually if I read the code in XML::Twig properly (I wrote it quite a while ago), it just parses the DTD with a dummy document, gets the entity info, and uses it later when parsing the main document. And "I don't like to do that, but it solves my problem" ;--(

        About entities in attributes: the Default handler is properly called when an entity is found in an attribute value, but the problem is that you can't do much at this point, and when the Start handler is called, the entity has disapeared from the attribute value that gets passed to it. Which is really annoying, especially as the default entities ('&', '<', '>'...) get properly replaced.

        For example this is scary, and shows that there isn't much that can be done that will work in all cases:

        #!/usr/bin/perl -w use strict; use XML::Parser; XML::Parser->new( Handlers => { Start => sub { print "att: '$_[3]'\n" +} }) ->parse( '<!DOCTYPE doc SYSTEM "dummy"><doc att="an &ent; a +nd an &amp;ent;"/>'); # prints att: 'an and an &ent;'