ikegami has asked for the wisdom of the Perl Monks concerning the following question:

A scrape tool I wrote downloads XHTML documents and parses them using XML::LibXML. As it turns out, it is hammering www.w3.org, fetching http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd (the XHTML DTD) and the three DTD it includes for every document I parse.

Can libxml2 be told to cache this? Or better yet, cache the parsed result? Is this what ext_ent_handler catches? Does someone already have a ext_ent_handler written? It seems silly that I have to do any of this.

See also: W3C Systems Team Blog: W3C's Excessive DTD Traffic.

Update: Well, the following answers my third question affirmatively:

use strict; use warnings; use XML::LibXML qw( ); my $xhtml = <<'__EOI__'; <?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Test</title> </head> <body>Test</body> </html> __EOI__ my $parser = XML::LibXML->new( ext_ent_handler => sub { use Data::Dumper; local $Data::Dumper::Useqq = 1; print(Dumper(\@_)); return ""; }, ); $parser->parse_string($xhtml);
$VAR1 = [ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd", "-//W3C//DTD XHTML 1.0 Strict//EN" ];

Replies are listed 'Best First'.
Re: Caching Entities with XML::LibXML
by grantm (Parson) on Feb 25, 2010 at 03:46 UTC

    My installation of XML::LibXML does not behave as you describe and does in fact use a catalog to refer to a local copy of the DTD. The POD for XML::LibXML::Parser discusses a load_catalog() method, but I'm not calling it - catalogs seem to get used automatically if they're installed in the default place.

    In my case I'm using Ubuntu Linux with the packaged version of XML::LibXML ('libxml-libxml-perl') from the Ubuntu repositories. I also have the 'w3c-dtd-xhtml' package installed. My system has a /etc/xml/catalog file which I've never edited - the package installers seem to look after that.

      That's good news. Well, not so much for me. My home machine, the one on which the script runs, is a WinXP machine.

        You might find this page useful: http://xmlsoft.org/catalog.html. It looks like the /etc/xml/catalog path is a hardcoded default but you can override it with an environment variable.

Re: Caching Entities with XML::LibXML
by Your Mother (Archbishop) on Feb 24, 2010 at 22:05 UTC

    I think the answer is "sort of." You are doing a validate() or is_valid() call, right? It is a minor pita but if that's right you can use a system path to a local copy of the same DTD. I have done this but it's been a while and I can't reach my old code tree right now. See XML::LibXML::Dtd for a bit more. HTML::DTD might make it a little less painful, maybe.

      I think the answer is "sort of." You are doing a validate() or is_valid() call, right?

      No, just a simple parse. Specifying validation => 0 doesn't stop the behaviour. The parser needs the DTD to know that &nbsp; is character U+00A0. I don't see how to tell the parser to use a preconstructed XML::LibXML::Dtd object.

      On the other hand, HTML::DTD does provide a handy source for the DTDs for one's ext_ent_handler. ( ... or not. It doesn't provided xhtml-lat1.ent which is required by xhtml1-strict.dtd )

      I just noticed something called "XML catalogs" in the Parser documentation. It sounds like a simple solution, and it sounds like it allows reuse of the compiled DTDs.

        Well, this is very interesting. Please update the OP or thread with your final solution, as it were.

Re: Caching Entities with XML::LibXML
by Corion (Patriarch) on Feb 24, 2010 at 22:19 UTC

    I think to prevent XML::LibXML from loading external entities, you can pass load_ext_dtd => 0 in the constructor. That will stop you from hammering w3.org, but I think it will also somewhat hamper validation, if that's your aim.

      I'm not doing any validation. Getting the entities is required for parsing. Preventing the fetching of the remote DTD without adding something in its place prevents parsing.
      use strict; use warnings; use XML::LibXML qw( ); my $parser = XML::LibXML->new( load_ext_dtd => 0, ); my $xhtml = <<'__EOI__'; <?xml version="1.0" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Test</title> </head> <body>&nbsp;</body> </html> __EOI__ $parser->parse_string($xhtml);
      :8: parser error : Entity 'nbsp' not defined <body>&nbsp;</body> ^