Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?

johannz has asked for the wisdom of the Perl Monks concerning the following question:

I am filtering HTML pages and would like to bold every mention of the term 'Perl' in the body. Currently, I am doing this with the HTML::TreeBuilder module, as the following code demonstrates.

My question is: Is this the most effective/safest way to do this?

FYI, this code is actually being used in a HTML::Mason autohandler, in the filter section. In the real code, I put the return value back into $_, as required by Mason. Therefore, I can not use print statements or any other direct output from my process. This also ensures that I only change instances of perl that are text, not included in links, images, or other tags.

#! perl
use strict;
use warnings;

use HTML::TreeBuilder;
use HTML::AsSubs;

undef $/;
print processHTML(<DATA>);
exit;

sub processHTML {
    my $tree = HTML::TreeBuilder->new_from_content($_[0]);
    $tree->elementify();
    my @body = $tree->look_down("_tag", "body");
    for my $body (@body) {
        $body->objectify_text();
        my @perl_parts = $body->look_down("_tag", "~text",
                sub {$_[0]->attr('text') =~ /perl/i}
            );

        for my $perl_text ( @perl_parts ) {
            my @items = split(/(perl)/i, $perl_text->attr('text'));
            $_ = (/perl/i ? b($_) : $_) for @items ;
            $perl_text->replace_with(@items)->delete();
        }
        $body->deobjectify_text();
    }
    my $return = $tree->as_HTML;
    $tree->delete;
    $return;
}

__END__
<html>
<head>
<title>This title contains Perl but does not get changed.</title>
</head>
<body>
<p>This is some text containing the term 'perl'.</p>
<ol>
    <li>Unix</li>
    <li>Perl</li>
    <li>Linux</li>
</ol>
<p>Notice how the term perl in the following link doesn't change, but 
+the text does. 
<a href="http://www.perlmonks.org">Perlmonks.org</a></p>
</body>
</html>
[download]

Comment on Is this the best way to use HTML::TreeBuilder to bold text in an HTML document? Download Code

Replies are listed 'Best First'.

(crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?
by crazyinsomniac (Prior) on Feb 02, 2002 at 02:13 UTC

I'll update this node with some code in about 5 min (i'm not on my computer)

#!/usr/bin/perl -w
use strict;
#use warnings;

use HTML::TokeParser;

undef $/;
print processHTML(<DATA>);

sub processHTML {
    my $tp = HTML::TokeParser->new(\$_[0]);
    my $return;

    while (my $token = $tp->get_token)
    {
        my $ttype = shift @{ $token };

        if($ttype eq "S")    # start tag?
        {
            $return .= $token->[3];
        }
        elsif($ttype eq "T") # text?
        {
            $token->[0] =~ s/(perl)/\<B\>$1\<\/B\>/ig;
            $return .= $token->[0];
        }
        elsif($ttype =~ /(?:C|D)/) # comment?declaration
        {
            $return .= $token->[0];
        }
        elsif($ttype =~ /(?:E|PI)/) # end tag?process instrunction
        {
            $return .= $token->[1];
        }
    } # endof while (my $token = $p->get_token)

    undef $tp;
    return $return;
}

__END__
<html>
<head>
<title>This title contains Perl but does not get changed.</title>
</head>
<body>
<p>This is some text containing the term 'perl'.</p>
<ol>
    <li>Unix</li>
    <li>Perl</li>
    <li>Linux</li>
</ol>
<p>Notice how the term perl in the following link doesn't change, but 
+the text does. 
<a href="http://www.perlmonks.org">Perlmonks.org</a></p>
</body>
</html>
[download]

update:

<title>This title contains Perl but does not get changed.</title>

Aww what the heck, here goes, one way to do it with HTML::(Toke)Parser

#!/usr/bin/perl -w
#boldemhtml.pl
use strict;
use warnings;
use HTML::Parser;
use HTML::TokeParser;

my ${Where_does_data_end} = tell DATA;
undef $/;
print processHTML(<DATA>);

seek DATA, ${Where_does_data_end}, 0;

print 'x' x 30, " HERE GO a little faster version \n";

print processHTML2(<DATA>);

exit;


sub processHTML {
    my $tp = HTML::TokeParser->new(\$_[0]);
    my $return;
    my $SENTINEL=1;

    while (my $token = $tp->get_token)
    {
        my $ttype = shift @{ $token };

        if($ttype eq "S")    # start tag?
        {
            $return .= $token->[3];
        }
        elsif($ttype eq "T") # text?
        {
            $token->[0] =~ s/(perl)/\<B\>$1\<\/B\>/ig
            unless $SENTINEL;

            $return .= $token->[0];
        }
        elsif($ttype =~ /(?:C|D)/) # comment?declaration
        {
            $return .= $token->[0];
        }
        elsif($ttype =~ /(?:E|PI)/) # end tag?process instrunction
        {
            $SENTINEL = 0 if $token->[0] eq 'title';
            $return .= $token->[1];
        }
    } # endof while (my $token = $p->get_token)

    undef $tp;
    return $return;
}

sub processHTML2 {
    my $SENTINEL = 1;
    my $p = HTML::Parser->new( api_version => 3);
    my $return;

    $p->handler(default => sub {
                               $return .= $_[0];
                               $SENTINEL = 0 if $_[1] eq 'end' and $_[
+2] eq '/title';
                               return undef;
                           }
                ,'text,event,tag');

=head1 the default handler could also be rewritten as
    $p->handler(default => sub { $return .= $_[0];
                                 $SENTINEL = 0 if $_[0] =~ m{</title>}
+i;
                                return undef;
                               }
                       ,'text');
    this version would only have a default handler
=cut



    $p->handler(text => sub {
                            $_[0] =~ s!(perl)!<B>$1</B>!ig
                            unless $SENTINEL;
                            $return .= $_[0];
                            return undef;
                        }
                ,'text');

    $p->parse($_[0]);
    undef $p;
    return $return;
}


__END__
<html>
<head>
<title>This title contains Perl but does not get changed.</title>
</head>
<body>
<p>This is some text containing the term 'perl'.</p>
<ol>
    <li>Unix</li>
    <li>Perl</li>
    <li>Linux</li>
</ol>
<p>Notice how the term perl in the following link doesn't change, but 
+the text does. 
<a href="http://www.perlmonks.org">Perlmonks.org</a></p>
</body>
</html>
[download]

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]
[d/l]
[select]

Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?
by gav^ (Curate) on Feb 02, 2002 at 03:06 UTC

I like HTML::TreeBuilder but sometimes it isn't best suited for the job at hand. For a simple solution that uses HTML::Parser see my post in this node: Highlight keywords in CGI search results, unless inside an HTML tag.

Hope this helps...

gav^

[reply]

Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?
by trs80 (Priest) on Feb 02, 2002 at 02:16 UTC

while ( <DATA> ) {
    if (/perl/) {
        s#(\W|\s+)(perl)(\W|\s+)#$1<b>$2</b>$3#i if !/<title>/;
    }
    print;
}
[download]

UPDATE

crazyinsomniac

johannz

perl

[reply]
[d/l]