Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

On a server, I have 500,000 HTML documents. Anywhere where my name occurs, I'd like to wrap it in a link to my homepage.

On the surface, this seems like a simple search and replace hack job. However, we're dealing with HTML and I can't just: $html =~ s/Mike Judge/\Q$URL\E/i . My name could appear in page titles or comments or even already enclosed within a link, so a simple (or even cleverly crafted) regular expression won't work.

I've spent some time perusing CPAN this evening and while HTML::TokeParser (or HTML::PullParser) looks like I'm on the right path, they don't have a way to manipulate plain text in an HTML file -- while still preserving the HTML structure.

This seems like a common problem that's been solved before. Anyone have any suggestions for search and replacing within half a million HTML files?

(As always, I appreciate the monks' kindness!)

  • Comment on Search and replacing across 500,000 HTML documents

Replies are listed 'Best First'.
Re: Search and replacing across 500,000 HTML documents
by PodMaster (Abbot) on Apr 22, 2004 at 10:25 UTC
    looks like I'm on the right path, they don't have a way to manipulate plain text in an HTML file -- while still preserving the HTML structure...
    I don't know what you've been doing, but you most certainly can. There is an example at (crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?.

    Also, a regex is not completely out of the question, something like *code goes here, working on it*

    use strict; use warnings; my $name = 'PodMaster'; my $url = 'http://perlmonks.org/?node=PodMaster'; my $html = q~ <html> <title> PodMaster </title> <style> PodMaster { } </style> <body> <h1>PodMaster </h1> Hi there PodMaster blah blah blah <b>Pod</b><i>Master</i> </body> </html> ~; print $/, untag_MOD( $html, $name, $url ), $/; #http://perlmonks.org/?node_id=161281 modified for our purposes sub untag_MOD { local $_ = $_[0] || $_; # ALGORITHM: # find < , # comment <!-- ... -->, # or comment <? ... ?> , # or one of the start tags which require correspond # end tag plus all to end tag # or if \s or =" # then skip to next " # else [^>] # > # 1 is the entire "tag", add +1 to all numbers in comments s{ ( # podmaster < # open tag (?: # open group (A) (!--) | # comment (1) or (\?) | # another comment (2) or (?i: # open group (B) for /i ( TITLE | # one of start tags SCRIPT | # for which APPLET | # must be skipped OBJECT | # all content STYLE # to correspond ) # end tag (3) ) | # close group (B), or ([!/A-Za-z]) # one of these chars, remember in (4) ) # close group (A) (?(5) # if previous case is (4) (?: # open group (C) (?! # and next is not : (D) [\s=] # \s or "=" ["`'] # with open quotes ) # close (D) [^>] | # and not close tag or [\s=] # \s or "=" with `[^`]*` | # something in quotes ` or [\s=] # \s or "=" with '[^']*' | # something in quotes ' or [\s=] # \s or "=" with "[^"]*" # something in quotes " )* # repeat (C) 0 or more times | # else (if previous case is not (4)) .*? # minimum of any chars ) # end if previous char is (4) (?(2) # if comment (1) (?<=--) # wait for "--" ) # end if comment (1) (?(3) # if another comment (2) (?<=\?) # wait for "?" ) # end if another comment (2) (?(4) # if one of tags-containers (3) </ # wait for end (?i:\4) # of this tag (?:\s[^>]*)? # skip junk to ">" ) # end if (3) > # tag closed ) ([^<]*) # 6, text } ' my $ret = $1; if( $6 ){ my $text = $6; $text =~ s~\b(\Q$_[1]\E)\b~<a href="$_[2]">$1</a>~g; # add + link $ret .= $text; } $ret; 'gsxe; return $_ ? $_ : ""; } __END__
    Note the caveats in strip HTML tags. Another potential (i wouldn't consider it one) caveat is that both of these don't translate <b>Pod</b><i>Master</i> into a link. If you want to do that you should use HTML::TreeBuilder.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Search and replacing across 500,000 HTML documents
by matija (Priest) on Apr 22, 2004 at 10:24 UTC
    Don't so much search-and-replace as read-possibly change-print.
    • As you go through the file with HTML::Parser (or whatever) remember if you're in an area where you want to do the replacement, or not (not in titles, yes in free text, etc) and print it to the new file.
    • Then your text handler subroutine only needs to check that variable, do the search-and-replace on the text string and print it regardless of if the substitution has been made.
Re: Search and replacing across 500,000 HTML documents
by Anonymous Monk on Apr 22, 2004 at 13:45 UTC
    If you're using HTML::TokeParser, I fail to see the problem. Once you've gotten your page parsed, e.g.:
    my $p = HTML::TokeParser->new(\$page_content);
    Just check using the get_text method:
    my $text = $p->get_text; if ($text =~ /$my_name/) {#stuff...}
    Inside some while loop... You can use "get_tag" before and make sure to replace only in tags that you know wrap plain text (div, p, etc.).

    Does this help?
Re: Search and replacing across 500,000 HTML documents
by pizza_milkshake (Monk) on Apr 22, 2004 at 20:16 UTC
    #!perl -wl # vim: set ts=8: # how to replace certain text (your $name) with a link (to $url) # based on what tags it is in inside the url # the one problem i see is that if text is already in <a> then we # embed an <a href></a> into it, the text after the href but # before the end of the original </a> will lose its properties. # this can be accounted for, but i'm lazy use strict; use HTML::Parser; # global variables our ($name, $url, @tagstack, $out) = qw{pizza http://www.parseerror.com/}; # sub defs sub start { my ($tag, $attr, $text) = @_; $tag .= " href" if defined $attr->{"href"}; push @tagstack, $tag; output($text); } sub end { my ($tag, $text) = @_; shift @tagstack while (@tagstack && $tagstack[0] !~ /^$tag$/); shift @tagstack if @tagstack; output($text); } sub text { my ($text) = @_; if ($text =~ /\b$name\b/ && canreplace() && unlinked()) { $text =~ s#\b$name\b#<a href="$url">$name</a>#g; } output($text); } sub output { my ($txt) = @_; $out .= $txt if defined $txt; } # are we inside a tag we don't want to link? sub canreplace { return ! grep {m/^(head|title)$/} @tagstack; } # are we inside a link right now? sub unlinked { return ! grep {m/^a href$/} @tagstack; } # start code my $p = HTML::Parser->new( "start_h" => [ \&start, "tagname, attr, text" ] ,"end_h" => [ \&end, "tagname, text" ] ,"text_h" => [ \&text, "text" ] ); $p->parse(q{ <html> <head> <title>pizza is delicious!</title> </head> <body> <a href="">pizza already linked</a> <b>some more pizza</b> <a>pizza in a but not linked</a> </body> </html> }); print $out;
    i've put the code up on http://www.parseerror.com/scripts/replace_name.pl

    perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'

Re: Search and replacing across 500,000 HTML documents
by Anonymous Monk on Apr 23, 2004 at 03:10 UTC
    Wow. Each of these replies are awesome. Rather than just jump into it and say, "Hey, I'm going to use this one." I'm going to try each of the methods (and code) suggested, learn what I can, and post back my experiences.
Re: Search and replacing across 500,000 HTML documents
by Anonymous Monk on Apr 23, 2004 at 19:33 UTC

    Solution #1: HTML::TreeBuilder

    my $path = '/var/www/html/tabulation.html'; my $URL = '<a href="http://www.surveycomplete.com/articles/">' . 'Mike Judge' . '</a>'; my $tree = HTML::TreeBuilder->new_from_file($path) or die "Can't open: $!"; $tree->elementify; foreach my $child ($tree->descendents()) { # Node doesn't come from a bad section of the tree, right? # (e.g. head section: titles, already in a href link) unless (grep { $_ =~ /(head|href)/i } ($child->lineage_tag_names, $child->all_external_attr_names)) { my @children = $child->content_list; my @text_indices = grep { !ref $children[$_] } 0 .. $#children +; foreach my $index (@text_indices) { my $content = $children[$index]; if ($content =~ /Mike Judge/i) { $content =~ s/Mike Judge/$URL/ig; my $literal = HTML::Element->new('~literal','text' = +> $content); $child->splice_content($index,1,$literal); } } } } print $tree->as_HTML; $tree->delete;

    My impressions of HTML::TreeBuilder aren't good. To get started using the module, I first had to read (and understand) all 72 functions of module HTML::Element. That's bad. I think the documentation needs to be organized into better categories.

    My code works, but I'm concerned about it.

    1. It's ugly. There's too much manipulating of index numbers and gratuitous use of grep for me to feel comfortable. There might be a shortcut for detecting whether the ancestors of the current node are 'href' or 'head' in HTML::Element, but I'm not eager to read about all 72 subs again.

    2. HTML::Element stores 'text content' in HTML as children of a node, rather than as nodes themselves, so to find the content, I had to step through each node descendant of the original root node, and detect if the child was a reference or not. If it wasn't a reference to another node, then I'm to assume that it's text content (so says the manual.) This feels crazy to me.

    HTML::TreeBuilder makes hard things possible, but it's too difficult to write easy-to-understand code with and that's really important to me.

    Onto solution #2 HTML::Parser