in reply to Search and replacing across 500,000 HTML documents

Solution #1: HTML::TreeBuilder

my $path = '/var/www/html/tabulation.html'; my $URL = '<a href="http://www.surveycomplete.com/articles/">' . 'Mike Judge' . '</a>'; my $tree = HTML::TreeBuilder->new_from_file($path) or die "Can't open: $!"; $tree->elementify; foreach my $child ($tree->descendents()) { # Node doesn't come from a bad section of the tree, right? # (e.g. head section: titles, already in a href link) unless (grep { $_ =~ /(head|href)/i } ($child->lineage_tag_names, $child->all_external_attr_names)) { my @children = $child->content_list; my @text_indices = grep { !ref $children[$_] } 0 .. $#children +; foreach my $index (@text_indices) { my $content = $children[$index]; if ($content =~ /Mike Judge/i) { $content =~ s/Mike Judge/$URL/ig; my $literal = HTML::Element->new('~literal','text' = +> $content); $child->splice_content($index,1,$literal); } } } } print $tree->as_HTML; $tree->delete;

My impressions of HTML::TreeBuilder aren't good. To get started using the module, I first had to read (and understand) all 72 functions of module HTML::Element. That's bad. I think the documentation needs to be organized into better categories.

My code works, but I'm concerned about it.

1. It's ugly. There's too much manipulating of index numbers and gratuitous use of grep for me to feel comfortable. There might be a shortcut for detecting whether the ancestors of the current node are 'href' or 'head' in HTML::Element, but I'm not eager to read about all 72 subs again.

2. HTML::Element stores 'text content' in HTML as children of a node, rather than as nodes themselves, so to find the content, I had to step through each node descendant of the original root node, and detect if the child was a reference or not. If it wasn't a reference to another node, then I'm to assume that it's text content (so says the manual.) This feels crazy to me.

HTML::TreeBuilder makes hard things possible, but it's too difficult to write easy-to-understand code with and that's really important to me.

Onto solution #2 HTML::Parser