Caching Entities with XML::LibXML

ikegami has asked for the wisdom of the Perl Monks concerning the following question:

A scrape tool I wrote downloads XHTML documents and parses them using XML::LibXML. As it turns out, it is hammering www.w3.org, fetching http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd (the XHTML DTD) and the three DTD it includes for every document I parse.

Can libxml2 be told to cache this? Or better yet, cache the parsed result? Is this what ext_ent_handler catches? Does someone already have a ext_ent_handler written? It seems silly that I have to do any of this.

Update: Well, the following answers my third question affirmatively:

use strict;
use warnings;

use XML::LibXML qw( );

my $xhtml = <<'__EOI__';
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test</title>
</head>
<body>Test</body>
</html>
__EOI__

my $parser = XML::LibXML->new(
    ext_ent_handler => sub {
        use Data::Dumper;
        local $Data::Dumper::Useqq = 1;
        print(Dumper(\@_));
        return "";
    },
);

$parser->parse_string($xhtml);
[download]

$VAR1 = [
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd",
          "-//W3C//DTD XHTML 1.0 Strict//EN"
        ];
[download]

Comment on Caching Entities with XML::LibXML Select or Download Code

Replies are listed 'Best First'.
Re: Caching Entities with XML::LibXML by grantm (Parson) on Feb 25, 2010 at 03:46 UTC
My installation of XML::LibXML does not behave as you describe and does in fact use a catalog to refer to a local copy of the DTD. The POD for XML::LibXML::Parser discusses a `load_catalog()` method, but I'm not calling it - catalogs seem to get used automatically if they're installed in the default place. In my case I'm using Ubuntu Linux with the packaged version of XML::LibXML ('libxml-libxml-perl') from the Ubuntu repositories. I also have the 'w3c-dtd-xhtml' package installed. My system has a `/etc/xml/catalog` file which I've never edited - the package installers seem to look after that.	[reply] [d/l] [select]
Re^2: Caching Entities with XML::LibXML by ikegami (Patriarch) on Feb 25, 2010 at 05:01 UTC
That's good news. Well, not so much for me. My home machine, the one on which the script runs, is a WinXP machine.	[reply]
Re^3: Caching Entities with XML::LibXML by grantm (Parson) on Feb 25, 2010 at 05:48 UTC
You might find this page useful: http://xmlsoft.org/catalog.html. It looks like the /etc/xml/catalog path is a hardcoded default but you can override it with an environment variable.	[reply]
Re: Caching Entities with XML::LibXML by Your Mother (Archbishop) on Feb 24, 2010 at 22:05 UTC
I think the answer is "sort of." You are doing a `validate()` or `is_valid()` call, right? It is a minor pita but if that's right you can use a system path to a local copy of the same DTD. I have done this but it's been a while and I can't reach my old code tree right now. See XML::LibXML::Dtd for a bit more. HTML::DTD might make it a little less painful, maybe.	[reply] [d/l] [select]
Re^2: Caching Entities with XML::LibXML by ikegami (Patriarch) on Feb 24, 2010 at 22:36 UTC
I think the answer is "sort of." You are doing a validate() or is_valid() call, right? No, just a simple parse. Specifying `validation => 0` doesn't stop the behaviour. The parser needs the DTD to know that ` ` is character U+00A0. I don't see how to tell the parser to use a preconstructed XML::LibXML::Dtd object. On the other hand, HTML::DTD does provide a handy source for the DTDs for one's `ext_ent_handler`. ( ... or not. It doesn't provided `xhtml-lat1.ent` which is required by `xhtml1-strict.dtd` ) I just noticed something called "XML catalogs" in the Parser documentation. It sounds like a simple solution, and it sounds like it allows reuse of the compiled DTDs.	[reply] [d/l] [select]
Re^3: Caching Entities with XML::LibXML by Your Mother (Archbishop) on Feb 24, 2010 at 22:46 UTC
Well, this is very interesting. Please update the OP or thread with your final solution, as it were.	[reply]
Re^4: Caching Entities with XML::LibXML by ikegami (Patriarch) on Feb 25, 2010 at 02:04 UTC
Re^4: Caching Entities with XML::LibXML by ikegami (Patriarch) on Feb 28, 2010 at 07:02 UTC
Re: Caching Entities with XML::LibXML by Corion (Patriarch) on Feb 24, 2010 at 22:19 UTC
I think to prevent XML::LibXML from loading external entities, you can pass `load_ext_dtd => 0` in the constructor. That will stop you from hammering w3.org, but I think it will also somewhat hamper validation, if that's your aim.	[reply] [d/l]
Re^2: Caching Entities with XML::LibXML by ikegami (Patriarch) on Feb 24, 2010 at 22:44 UTC
I'm not doing any validation. Getting the entities is required for parsing. Preventing the fetching of the remote DTD without adding something in its place prevents parsing. `use strict; use warnings; use XML::LibXML qw( ); my $parser = XML::LibXML->new( load_ext_dtd => 0, ); my $xhtml = <<'__EOI__'; <?xml version="1.0" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Test</title> </head> <body> </body> </html> __EOI__ $parser->parse_string($xhtml);` [download] `:8: parser error : Entity 'nbsp' not defined <body> </body> ^` [download]	[reply] [d/l] [select]