(crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?

I don't know about effective, or safe, but why create a big old tree when all you are doing is simple filtering? Your memory overhead must be great since you are working within a HTML::Mason framework, why add to the burden? I would use HTML::Parser or HTML::TokeParser to approach this problem.

I'll update this node with some code in about 5 min (i'm not on my computer)

#!/usr/bin/perl -w
use strict;
#use warnings;

use HTML::TokeParser;

undef $/;
print processHTML(<DATA>);

sub processHTML {
    my $tp = HTML::TokeParser->new(\$_[0]);
    my $return;

    while (my $token = $tp->get_token)
    {
        my $ttype = shift @{ $token };

        if($ttype eq "S")    # start tag?
        {
            $return .= $token->[3];
        }
        elsif($ttype eq "T") # text?
        {
            $token->[0] =~ s/(perl)/\<B\>$1\<\/B\>/ig;
            $return .= $token->[0];
        }
        elsif($ttype =~ /(?:C|D)/) # comment?declaration
        {
            $return .= $token->[0];
        }
        elsif($ttype =~ /(?:E|PI)/) # end tag?process instrunction
        {
            $return .= $token->[1];
        }
    } # endof while (my $token = $p->get_token)

    undef $tp;
    return $return;
}

__END__
<html>
<head>
<title>This title contains Perl but does not get changed.</title>
</head>
<body>
<p>This is some text containing the term 'perl'.</p>
<ol>
    <li>Unix</li>
    <li>Perl</li>
    <li>Linux</li>
</ol>
<p>Notice how the term perl in the following link doesn't change, but 
+the text does. 
<a href="http://www.perlmonks.org">Perlmonks.org</a></p>
</body>
</html>
[download]

update:
after visiting this thread again, and looking a little closer at the html after __DATA__ I saw <title>This title contains Perl but does not get changed.</title> Well I kind of ignored that portion ;), but it's easy to include a sentinel in the above loop.

Aww what the heck, here goes, one way to do it with HTML::(Toke)Parser

#!/usr/bin/perl -w
#boldemhtml.pl
use strict;
use warnings;
use HTML::Parser;
use HTML::TokeParser;

my ${Where_does_data_end} = tell DATA;
undef $/;
print processHTML(<DATA>);

seek DATA, ${Where_does_data_end}, 0;

print 'x' x 30, " HERE GO a little faster version \n";

print processHTML2(<DATA>);

exit;


sub processHTML {
    my $tp = HTML::TokeParser->new(\$_[0]);
    my $return;
    my $SENTINEL=1;

    while (my $token = $tp->get_token)
    {
        my $ttype = shift @{ $token };

        if($ttype eq "S")    # start tag?
        {
            $return .= $token->[3];
        }
        elsif($ttype eq "T") # text?
        {
            $token->[0] =~ s/(perl)/\<B\>$1\<\/B\>/ig
            unless $SENTINEL;

            $return .= $token->[0];
        }
        elsif($ttype =~ /(?:C|D)/) # comment?declaration
        {
            $return .= $token->[0];
        }
        elsif($ttype =~ /(?:E|PI)/) # end tag?process instrunction
        {
            $SENTINEL = 0 if $token->[0] eq 'title';
            $return .= $token->[1];
        }
    } # endof while (my $token = $p->get_token)

    undef $tp;
    return $return;
}

sub processHTML2 {
    my $SENTINEL = 1;
    my $p = HTML::Parser->new( api_version => 3);
    my $return;

    $p->handler(default => sub {
                               $return .= $_[0];
                               $SENTINEL = 0 if $_[1] eq 'end' and $_[
+2] eq '/title';
                               return undef;
                           }
                ,'text,event,tag');

=head1 the default handler could also be rewritten as
    $p->handler(default => sub { $return .= $_[0];
                                 $SENTINEL = 0 if $_[0] =~ m{</title>}
+i;
                                return undef;
                               }
                       ,'text');
    this version would only have a default handler
=cut



    $p->handler(text => sub {
                            $_[0] =~ s!(perl)!<B>$1</B>!ig
                            unless $SENTINEL;
                            $return .= $_[0];
                            return undef;
                        }
                ,'text');

    $p->parse($_[0]);
    undef $p;
    return $return;
}


__END__
<html>
<head>
<title>This title contains Perl but does not get changed.</title>
</head>
<body>
<p>This is some text containing the term 'perl'.</p>
<ol>
    <li>Unix</li>
    <li>Perl</li>
    <li>Linux</li>
</ol>
<p>Notice how the term perl in the following link doesn't change, but 
+the text does. 
<a href="http://www.perlmonks.org">Perlmonks.org</a></p>
</body>
</html>
[download]

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

Comment on (crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document? Select or Download Code