rob50 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I am processing a large number of external XML files. Is there a way to make the XML::Twig parser simply ignore entities and parse successfully? I've tried, "my $twig = new XML::Twig(expand_external_ents => -1);" but I get the same error. Thank you! Rob

Replies are listed 'Best First'.
Re: XML Twig entities
by mirod (Canon) on Apr 09, 2009 at 19:52 UTC

    If your document doesn't include a DTD, which I suspect, try including a "fake" one. Make it start with <!DOCTYPE whtever SYSTEM "dummy.dtd"> That should trick the parser in XML::Parser into believing that entities are defined, without actually expanding them. That should be what you are looking for.

      Thanks a lot! I'll give it a try.
Re: XML Twig entities
by ikegami (Patriarch) on Apr 09, 2009 at 18:54 UTC
    Aside from the contradiction of successfully parsing XML without parsing entities, it sounds like the wrong solution to an unspecified problem. What problem are you actually trying to solve?
      I just need to grab the text from certain elements to make document term vectors for querying. I just need the "words" and an id. The problem is I'm parsing thousands of XMLs from various external sources. I don't have entity lists for all of them and I can't predict what entities will appear. And I don't need the entities anyway. Thanks, Rob

        I've tried, "my $twig = new XML::Twig(expand_external_ents => -1);" but I get the same error.

        &#x49; &#x73;&#x65;&#x65;&#x6D; &#x74;&#x6F; &#x68;&#x61;&#x76;&#x65; &#x6D;&#x69;&#x73;&#x73;&#x65;&#x64; &#x74;&#x68;&#x61;&#x74; &#x6F;&#x72;&#x69;&#x67;&#x69;&#x6E;&#x61;&#x6C;&#x6C;&#x79;&#x2E; &#x57;&#x68;&#x61;&#x74; &#x65;&#x72;&#x72;&#x6F;&#x72; &#x69;&#x73; &#x74;&#x68;&#x61;&#x74;&#x3F;

        I just need to grab the text from certain elements [...] And I don't need the entities anyway.

        &#x54;&#x68;&#x65; &#x65;&#x6E;&#x74;&#x69;&#x74;&#x69;&#x65;&#x73; &#x72;&#x65;&#x70;&#x72;&#x65;&#x73;&#x65;&#x6E;&#x74; &#x74;&#x65;&#x78;&#x74;&#x2E;


        [ For the rest of the monks ]

        I've tried, "my $twig = new XML::Twig(expand_external_ents => -1);" but I get the same error.

        I seem to have missed that originally. What error is that?

        I just need to grab the text from certain elements [...] And I don't need the entities anyway.

        The entities represent text.