http://qs1969.pair.com?node_id=476771

moshkod has asked for the wisdom of the Perl Monks concerning the following question:

I have a cgi program that gets an valid XML and then parses it using XML::SAX::ParserFactory, the parsing fails with the following error:
[Wed Jul 20 17:23:23 2005] [error] [Wed Jul 20 17:23:23 2005] -e: \n[Wed Jul 20 17:23:23 2005] -e: 500 Can't connect to www.ncbi.nlm.nih.gov:80 (connect: Connection timed out) http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_041101.dtd\n[Wed Jul 20 17:23:23 2005] -e: Handler couldn't resolve external entity at line 2, column 149, byte 171\n[Wed Jul 20 17:23:23 2005] -e: error in processing external entity reference at line 2, column 149, byte 171 at /exlibris/sfx_ver/sfx_version_3/app/perl-5.8.6/lib/site_perl/5.8.6/ +sun4-solaris/XML/Parser.pm line 187\n
I believe that the parsing fails because it can not get to the DTD (since the server is behind a firewall) but it is not clear to me why it attempts to connect to the dtd at: http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_041101.dtd
since XML::SAX::ParserFactory by default does not do any validation and in my program i did not change the default settings of the parser.

any ideas?
Thanks Dana

Replies are listed 'Best First'.
Re: problem using XML::SAX::ParserFactory
by arturo (Vicar) on Jul 21, 2005 at 13:17 UTC

    As is pointed out in the XML spec, DTDs aren't just for validation; they also tell you what the default value of attributes are, and potentially contain entity declarations. In a way, DTDs contain some of the document's information. So, because these things are needed to 'read' the document, conforming parsers are required to read them if they are declared -- whether or not the parser is validating.

    If you're sure the document's valid, you could strip out the DOCTYPE declaration before siccing your parser on it.

    update : an even better solution would be to get a local copy of the DTD and use catalog resolution to find it. A catalog resolver can tell a parser where to look for a DTD with a given public ID, irrespective of the system ID (the URL in the doctype declaration).

    If not P, what? Q maybe?
    "Sidney Morgenbesser"

      Dana, I found something interesting in the documentation.

      The solution is along the lines of what the other poster suggested.

      I think you can redirect the DTD lookup to a local file by using the resolve_entity method. It says that you "can use this method to redirect external system identifiers to secure and/or local URIs, to look up public identifiers in a catalogue, or to read an entity from a database or other input source (including, for example, a dialog box)."

      From CPAN