Search and replacing across 500,000 HTML documents

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Search and replacing across 500,000 HTML documents
by PodMaster (Abbot) on Apr 22, 2004 at 10:25 UTC

looks like I'm on the right path, they don't have a way to manipulate plain text in an HTML file -- while still preserving the HTML structure...

(crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?

Also, a regex is not completely out of the question, something like ~~*code goes here, working on it*~~

use strict;
use warnings;
my $name = 'PodMaster';
my $url  = 'http://perlmonks.org/?node=PodMaster';
my $html = q~
<html>
<title> PodMaster </title>
<style>
PodMaster { }
</style>
<body>
<h1>PodMaster
</h1>
Hi there PodMaster blah blah blah <b>Pod</b><i>Master</i>
</body>
</html>
~;


print $/, untag_MOD( $html, $name, $url ), $/;

#http://perlmonks.org/?node_id=161281  modified for our purposes
sub untag_MOD {
  local $_ = $_[0] || $_;
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
# 1 is the entire "tag", add +1 to all numbers in comments 
  s{
  ( # podmaster
    <               # open tag
    (?:             # open group (A)
      (!--) |       #   comment (1) or
      (\?) |        #   another comment (2) or
      (?i:          #   open group (B) for /i
        ( TITLE  |  #     one of start tags
          SCRIPT |  #     for which
          APPLET |  #     must be skipped
          OBJECT |  #     all content
          STYLE     #     to correspond
        )           #     end tag (3)
      ) |           #   close group (B), or
      ([!/A-Za-z])  #   one of these chars, remember in (4)
    )               # close group (A)
    (?(5)           # if previous case is (4)
      (?:           #   open group (C)
        (?!         #     and next is not : (D)
          [\s=]     #       \s or "="
          ["`']     #       with open quotes
        )           #     close (D)
        [^>] |      #     and not close tag or
        [\s=]       #     \s or "=" with
        `[^`]*` |   #     something in quotes ` or
        [\s=]       #     \s or "=" with
        '[^']*' |   #     something in quotes ' or
        [\s=]       #     \s or "=" with
        "[^"]*"     #     something in quotes "
      )*            #   repeat (C) 0 or more times
    |               # else (if previous case is not (4))
      .*?           #   minimum of any chars
    )               # end if previous char is (4)
    (?(2)           # if comment (1)
      (?<=--)       #   wait for "--"
    )               # end if comment (1)
    (?(3)           # if another comment (2)
      (?<=\?)       #   wait for "?"
    )               # end if another comment (2)
    (?(4)           # if one of tags-containers (3)
      </            #   wait for end
      (?i:\4)       #   of this tag
      (?:\s[^>]*)?  #   skip junk to ">"
    )               # end if (3)
    >               # tag closed
   )
   ([^<]*) # 6, text
   }
   '
       my $ret = $1;
       if( $6 ){
           my $text = $6;
           $text =~ s~\b(\Q$_[1]\E)\b~<a href="$_[2]">$1</a>~g;  # add
+ link
           $ret .= $text;
       }
       $ret;
   'gsxe;
  return $_ ? $_ : "";
}


__END__
[download]

strip HTML tags

<b>Pod</b><i>Master</i>

HTML::TreeBuilder

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]
[select]

Re: Search and replacing across 500,000 HTML documents
by matija (Priest) on Apr 22, 2004 at 10:24 UTC

As you go through the file with HTML::Parser (or whatever) remember if you're in an area where you want to do the replacement, or not (not in titles, yes in free text, etc) and print it to the new file.
Then your text handler subroutine only needs to check that variable, do the search-and-replace on the text string and print it regardless of if the substitution has been made.

[reply]

Re: Search and replacing across 500,000 HTML documents
by Anonymous Monk on Apr 22, 2004 at 13:45 UTC

   my $p = HTML::TokeParser->new(\$page_content);
[download]

   my $text = $p->get_text;
   if ($text =~ /$my_name/) {#stuff...}
[download]

[reply]
[d/l]
[select]

Re: Search and replacing across 500,000 HTML documents
by pizza_milkshake (Monk) on Apr 22, 2004 at 20:16 UTC

#!perl -wl
# vim: set ts=8:

# how to replace certain text (your $name) with a link (to $url)
# based on what tags it is in inside the url

# the one problem i see is that if text is already in <a> then we
# embed an <a href></a> into it, the text after the href but
# before the end of the original </a> will lose its properties.
# this can be accounted for, but i'm lazy

use strict;
use HTML::Parser;

# global variables

our ($name, $url, @tagstack, $out) =
        qw{pizza http://www.parseerror.com/};

# sub defs

sub start {
        my ($tag, $attr, $text) = @_;
        $tag .= " href" if defined $attr->{"href"};
        push @tagstack, $tag;
        output($text);
}

sub end {
        my ($tag, $text) = @_;
        shift @tagstack
                while (@tagstack && $tagstack[0] !~ /^$tag$/);
        shift @tagstack if @tagstack;
        output($text);
}

sub text {
        my ($text) = @_;
        if ($text =~ /\b$name\b/ && canreplace() && unlinked()) {
                $text =~ s#\b$name\b#<a href="$url">$name</a>#g;
        }
        output($text);
}

sub output {
        my ($txt) = @_;
        $out .= $txt if defined $txt;
}

# are we inside a tag we don't want to link?
sub canreplace {
        return ! grep {m/^(head|title)$/} @tagstack;
}

# are we inside a link right now?
sub unlinked {
        return ! grep {m/^a href$/} @tagstack;
}

# start code

my $p = HTML::Parser->new(
        "start_h"       => [ \&start, "tagname, attr, text" ]
        ,"end_h"        => [ \&end, "tagname, text" ]
        ,"text_h"       => [ \&text, "text" ]
);

$p->parse(q{
<html>
        <head>
                <title>pizza is delicious!</title>
        </head>
        <body>
                <a href="">pizza already linked</a>
                <b>some more pizza</b>
                <a>pizza in a but not linked</a>
        </body>
</html>
});

print $out;
[download]

perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'

[reply]
[d/l]

Re: Search and replacing across 500,000 HTML documents
by Anonymous Monk on Apr 23, 2004 at 03:10 UTC

Wow. Each of these replies are awesome. Rather than just jump into it and say, "Hey, I'm going to use this one." I'm going to try each of the methods (and code) suggested, learn what I can, and post back my experiences.

[reply]

Re: Search and replacing across 500,000 HTML documents
by Anonymous Monk on Apr 23, 2004 at 19:33 UTC

Solution #1: HTML::TreeBuilder

my $path = '/var/www/html/tabulation.html';

my $URL = '<a href="http://www.surveycomplete.com/articles/">' . 
          'Mike Judge' . 
          '</a>';

my $tree = HTML::TreeBuilder->new_from_file($path) 
          or die "Can't open: $!";

$tree->elementify;

foreach my $child ($tree->descendents()) {
    # Node doesn't come from a bad section of the tree, right?
    # (e.g. head section: titles, already in a href link)

    unless (grep { $_ =~ /(head|href)/i } 
        ($child->lineage_tag_names, 
         $child->all_external_attr_names)) {
    
        my @children = $child->content_list;
        my @text_indices = grep { !ref $children[$_] } 0 .. $#children
+;
    
        foreach my $index (@text_indices) {
            my $content = $children[$index];
        
            if ($content =~ /Mike Judge/i) {
                $content =~ s/Mike Judge/$URL/ig;

                  my $literal = HTML::Element->new('~literal','text' =
+> $content);
                $child->splice_content($index,1,$literal);
            }
        }
    }
}

print $tree->as_HTML;

$tree->delete;
[download]

My impressions of HTML::TreeBuilder aren't good. To get started using the module, I first had to read (and understand) all 72 functions of module HTML::Element. That's bad. I think the documentation needs to be organized into better categories.

My code works, but I'm concerned about it.

1. It's ugly. There's too much manipulating of index numbers and gratuitous use of grep for me to feel comfortable. There might be a shortcut for detecting whether the ancestors of the current node are 'href' or 'head' in HTML::Element, but I'm not eager to read about all 72 subs again.

2. HTML::Element stores 'text content' in HTML as children of a node, rather than as nodes themselves, so to find the content, I had to step through each node descendant of the original root node, and detect if the child was a reference or not. If it wasn't a reference to another node, then I'm to assume that it's text content (so says the manual.) This feels crazy to me.

HTML::TreeBuilder makes hard things possible, but it's too difficult to write easy-to-understand code with and that's really important to me.

Onto solution #2 HTML::Parser

[reply]
[d/l]