metaperl has asked for the wisdom of the Perl Monks concerning the following question:

After parsing an HTML file with HTML::TreeBuilder one gets back a large nested data structure:

my $tree = HTML::TreeBuilder->new_from_file($filename);

In the interest of efficiency, I would like to develop Perl modules which have the HTML parsed and already in-memory at mod_perl startup time:

package html::page::hello_world;


my $tree = HTML::TreeBuilder->new_from_file('/html/hello_world.html');

sub new {
    $tree
}

1;

This way, the module is used at server startup time and the constructor call incurs no delay due to parsing the HTML file. However, there is one problem: once the returned tree is modified, the new would return that same modified tree instead of a tree representing a fresh parse of the HTML file.

I therefore want to clone the tree and return a clone:

package html::page::hello_world;


my $tree  = HTML::TreeBuilder->new_from_file('/html/hello_world.html');
my $clone = $tree->clone;

sub new {
    my $retval = $clone;
    $clone = $tree->clone;
    $retval;
}

1;

But I don't want the overhead of making the new clone in the same process. I want to do something like a fork and return the pre-made clone to the caller immediately so it doesn't have to wait and manufacture a new clone in a separate thread/process.

Could anyone recommend a strategy/module for doing this?

  • Comment on fast return of HTML::Tree object clones (via threads/forks)?

Replies are listed 'Best First'.
Re: fast return of HTML::Tree object clones (via threads/forks)? (magic)
by tye (Sage) on Nov 15, 2005 at 21:27 UTC

    Have the mod_perl process create the new clone of the object after the response has been sent. The trick is getting Apache to mark the response as complete before you start this cloning (so that the client doesn't wait around). No, I have no idea how to get mod_perl / Apache to let you do that. You might have to do some research.

    - tye        

Re: fast return of HTML::Tree object clones (via threads/forks)?
by dragonchild (Archbishop) on Nov 15, 2005 at 19:10 UTC
    Why are you modifying the tree? That sounds rather suspect to me. Plus, I'll bet that if you benchmark it, the cost of cloning is going to be about 80-90% of the cost of parsing. Try it out. If that's all it is, then I'd suggest not worrying about it.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      Why are you modifying the tree? That sounds rather suspect to me.
      If you weren't modifying the tree, then you would be better of simply serving it via Apache without loading it into memory and or HTML::Tree to begin with :).

      So anyway, I was keeping my discussion tongue-in-cheek because I did not want to open another can of worms, but the real scenario is that HTML::Seamstress is a set of convenience functions for dynamic HTML generation via HTML::Tree and currently I use new_from_file but was thinking I could get a speedup by pre-parsing and cloning. But you are right, I probably should do some benchmarking on the two approaches before getting into a hissy.

      But wait - there is a very good reason to pre-parse anyway. You can parse the HTML and then throw it away. The way things are setup where I work, our apache server talks to our HTML aggregation server via tcp/ip. The apache server serves static things like images, css, etc. But for HTML aggregation of search results, it forwards the request to the aggregation server, which makes a bunch of search requests and returns dynamically generated HTML... when we get a package of HTML and gifs and css from the design department, the first thing I have to do is copy the HTML file to the aggregation server and then make a .pm of it so that I can dynamically rewrite it at runtime. The gifs and css stay on the apache server. So, I could save myself something of a step if I could parse the HTML once, serialize it, and then clone it at runtime as necessary.