johannz has asked for the wisdom of the Perl Monks concerning the following question:

I am filtering HTML pages and would like to bold every mention of the term 'Perl' in the body. Currently, I am doing this with the HTML::TreeBuilder module, as the following code demonstrates.

My question is: Is this the most effective/safest way to do this?

FYI, this code is actually being used in a HTML::Mason autohandler, in the filter section. In the real code, I put the return value back into $_, as required by Mason. Therefore, I can not use print statements or any other direct output from my process. This also ensures that I only change instances of perl that are text, not included in links, images, or other tags.

#! perl use strict; use warnings; use HTML::TreeBuilder; use HTML::AsSubs; undef $/; print processHTML(<DATA>); exit; sub processHTML { my $tree = HTML::TreeBuilder->new_from_content($_[0]); $tree->elementify(); my @body = $tree->look_down("_tag", "body"); for my $body (@body) { $body->objectify_text(); my @perl_parts = $body->look_down("_tag", "~text", sub {$_[0]->attr('text') =~ /perl/i} ); for my $perl_text ( @perl_parts ) { my @items = split(/(perl)/i, $perl_text->attr('text')); $_ = (/perl/i ? b($_) : $_) for @items ; $perl_text->replace_with(@items)->delete(); } $body->deobjectify_text(); } my $return = $tree->as_HTML; $tree->delete; $return; } __END__ <html> <head> <title>This title contains Perl but does not get changed.</title> </head> <body> <p>This is some text containing the term 'perl'.</p> <ol> <li>Unix</li> <li>Perl</li> <li>Linux</li> </ol> <p>Notice how the term perl in the following link doesn't change, but +the text does. <a href="http://www.perlmonks.org">Perlmonks.org</a></p> </body> </html>
  • Comment on Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?
  • Download Code

Replies are listed 'Best First'.
(crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?
by crazyinsomniac (Prior) on Feb 02, 2002 at 02:13 UTC
    I don't know about effective, or safe, but why create a big old tree when all you are doing is simple filtering? Your memory overhead must be great since you are working within a HTML::Mason framework, why add to the burden? I would use HTML::Parser or HTML::TokeParser to approach this problem.

    I'll update this node with some code in about 5 min (i'm not on my computer)

    #!/usr/bin/perl -w use strict; #use warnings; use HTML::TokeParser; undef $/; print processHTML(<DATA>); sub processHTML { my $tp = HTML::TokeParser->new(\$_[0]); my $return; while (my $token = $tp->get_token) { my $ttype = shift @{ $token }; if($ttype eq "S") # start tag? { $return .= $token->[3]; } elsif($ttype eq "T") # text? { $token->[0] =~ s/(perl)/\<B\>$1\<\/B\>/ig; $return .= $token->[0]; } elsif($ttype =~ /(?:C|D)/) # comment?declaration { $return .= $token->[0]; } elsif($ttype =~ /(?:E|PI)/) # end tag?process instrunction { $return .= $token->[1]; } } # endof while (my $token = $p->get_token) undef $tp; return $return; } __END__ <html> <head> <title>This title contains Perl but does not get changed.</title> </head> <body> <p>This is some text containing the term 'perl'.</p> <ol> <li>Unix</li> <li>Perl</li> <li>Linux</li> </ol> <p>Notice how the term perl in the following link doesn't change, but +the text does. <a href="http://www.perlmonks.org">Perlmonks.org</a></p> </body> </html>
    update:
    after visiting this thread again, and looking a little closer at the html after __DATA__ I saw <title>This title contains Perl but does not get changed.</title> Well I kind of ignored that portion ;), but it's easy to include a sentinel in the above loop.

    Aww what the heck, here goes, one way to do it with HTML::(Toke)Parser

    #!/usr/bin/perl -w #boldemhtml.pl use strict; use warnings; use HTML::Parser; use HTML::TokeParser; my ${Where_does_data_end} = tell DATA; undef $/; print processHTML(<DATA>); seek DATA, ${Where_does_data_end}, 0; print 'x' x 30, " HERE GO a little faster version \n"; print processHTML2(<DATA>); exit; sub processHTML { my $tp = HTML::TokeParser->new(\$_[0]); my $return; my $SENTINEL=1; while (my $token = $tp->get_token) { my $ttype = shift @{ $token }; if($ttype eq "S") # start tag? { $return .= $token->[3]; } elsif($ttype eq "T") # text? { $token->[0] =~ s/(perl)/\<B\>$1\<\/B\>/ig unless $SENTINEL; $return .= $token->[0]; } elsif($ttype =~ /(?:C|D)/) # comment?declaration { $return .= $token->[0]; } elsif($ttype =~ /(?:E|PI)/) # end tag?process instrunction { $SENTINEL = 0 if $token->[0] eq 'title'; $return .= $token->[1]; } } # endof while (my $token = $p->get_token) undef $tp; return $return; } sub processHTML2 { my $SENTINEL = 1; my $p = HTML::Parser->new( api_version => 3); my $return; $p->handler(default => sub { $return .= $_[0]; $SENTINEL = 0 if $_[1] eq 'end' and $_[ +2] eq '/title'; return undef; } ,'text,event,tag'); =head1 the default handler could also be rewritten as $p->handler(default => sub { $return .= $_[0]; $SENTINEL = 0 if $_[0] =~ m{</title>} +i; return undef; } ,'text'); this version would only have a default handler =cut $p->handler(text => sub { $_[0] =~ s!(perl)!<B>$1</B>!ig unless $SENTINEL; $return .= $_[0]; return undef; } ,'text'); $p->parse($_[0]); undef $p; return $return; } __END__ <html> <head> <title>This title contains Perl but does not get changed.</title> </head> <body> <p>This is some text containing the term 'perl'.</p> <ol> <li>Unix</li> <li>Perl</li> <li>Linux</li> </ol> <p>Notice how the term perl in the following link doesn't change, but +the text does. <a href="http://www.perlmonks.org">Perlmonks.org</a></p> </body> </html>

     
    ______crazyinsomniac_____________________________
    Of all the things I've lost, I miss my mind the most.
    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?
by gav^ (Curate) on Feb 02, 2002 at 03:06 UTC
Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?
by trs80 (Priest) on Feb 02, 2002 at 02:16 UTC
    This is an interesting question. My quick and dirty suggestion would that instead of using the module you could simply do something like this:
    while ( <DATA> ) { if (/perl/) { s#(\W|\s+)(perl)(\W|\s+)#$1<b>$2</b>$3#i if !/<title>/; } print; }
    Not knowing the sensitivity of the data we are dealing with I would say this is a simpler way and much less code to maintain. There may also very well be content outside of the body tag that you don't want, in which case the above quick and dirty example starts be take on more code.
    Do you want the Perl in Perlmonks to be bold? If not you you should consider changing the regex in the filter code to something more along the lines of the quick and dirty solution if you feel the HTML::TreeBuilder module works better for your needs.

    UPDATE: I added the quick /perl/ match to speed up operations. I also see why my quick solution is dangerous, because it might modify content with in html tags. I ran the solution posted by crazyinsomniac and johannz, both will match words like 'properly', I am not sure what the actual word is that is being matched, but a safer regex will have to be used most likely.