(jeffa) Re: Regexp to ignore HTML tags

Here is a very brute force method that replaces all text items 'foo' with 'bar' using ~~HTML::Template~~ HTML::Parser . The idea is to set callbacks everytime:

a start tag is encountered
the inside text element is encountered
the end tag is encountered

For the first and third cases, we simply regurgitate what we found to STDOUT. For the middle case, we substitute. Hope this helps. :)

use strict;
use warnings;
use HTML::Parser;

my $parser = HTML::Parser->new(api_version => 3);

$parser->handler(start => \&start, 'self,tagname,attr');
$parser->handler(text  => \&text,  'self,dtext');
$parser->handler(end   => \&end,   'self,tagname');

$parser->parse(q|<foo bar="qux" baz="foo">foo</foo>|);

sub start {
   my ($parser,$tag,$attr) = @_;
   print "<$tag";
   # we lose the original order of attribs, but we'll live ;)
   print qq| $_="$attr->{$_}"| for keys %$attr;
   print ">";
}

sub text {
   my ($parser,$text,$attr) = @_;
   $text =~ s/foo/bar/g;
   print "$text";
}

sub end {
   my ($parser,$tag) = @_;
   print "</$tag>";
}
[download]

UPDATE:
Thanks hiseldl ... maybe it really is time for me to switch from HTML::Template to TT! :D

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Comment on (jeffa) Re: Regexp to ignore HTML tags Download Code

Replies are listed 'Best First'.
Re: (jeffa) Re: Regexp to ignore HTML tags by hiseldl (Priest) on Mar 31, 2003 at 16:47 UTC
jeffa, you wrote a good example; a side note -- your notes say HTML::Template whereas you probably meant to say HTML::Parser. Cheers! -- hiseldl What time is it? It's Camel Time!	[reply]