Regexp to ignore HTML tags

markhoy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regexp to ignore HTML tags by davorg (Chancellor) on Mar 31, 2003 at 14:59 UTC
If you're dealing with HTML then you should look at HTML::Parser or one of its subclasses. Update: Here's one example: `#!/usr/bin/perl use warnings; use strict; use HTML::Parser; my $p = HTML::Parser->new(text_h => [\&text, 'text'], default_h => [\&passthru, 'text']); $p->parse_file(DATA); sub text { $_[0] =~ s/foo/bar/; print $_[0]; } sub passthru { print $_[0]; } __DATA__ <html> <head><title>foo</title><head> <body> <h1 class="foo">foo<h1> </body> </html>` [download] -- <http://www.dave.org.uk> "The first rule of Perl club is you do not talk about Perl club."* -- Chip Salzenberg	[reply] [d/l]
(jeffa) Re: Regexp to ignore HTML tags by jeffa (Bishop) on Mar 31, 2003 at 15:06 UTC
Here is a very brute force method that replaces all text items 'foo' with 'bar' using ~~HTML::Template~~ HTML::Parser . The idea is to set callbacks everytime: a start tag is encountered the inside text element is encountered the end tag is encountered For the first and third cases, we simply regurgitate what we found to STDOUT. For the middle case, we substitute. Hope this helps. :) use strict; use warnings; use HTML::Parser; my $parser = HTML::Parser->new(api_version => 3); $parser->handler(start => \&start, 'self,tagname,attr'); $parser->handler(text => \&text, 'self,dtext'); $parser->handler(end => \&end, 'self,tagname'); $parser->parse(q\|<foo bar="qux" baz="foo">foo</foo>\|); sub start { my ($parser,$tag,$attr) = @_; print "<$tag"; # we lose the original order of attribs, but we'll live ;) print qq\| $_="$attr->{$_}"\| for keys %$attr; print ">"; } sub text { my ($parser,$text,$attr) = @_; $text =~ s/foo/bar/g; print "$text"; } sub end { my ($parser,$tag) = @_; print "</$tag>"; } [download] UPDATE: Thanks hiseldl ... maybe it really is time for me to switch from HTML::Template to TT! :D jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]
Re: (jeffa) Re: Regexp to ignore HTML tags by hiseldl (Priest) on Mar 31, 2003 at 16:47 UTC
jeffa, you wrote a good example; a side note -- your notes say HTML::Template whereas you probably meant to say HTML::Parser. Cheers! -- hiseldl What time is it? It's Camel Time!	[reply]
Re: Regexp to ignore HTML tags by hardburn (Abbot) on Mar 31, 2003 at 15:00 UTC
Regexen are generally considered unsuitible for doing work with HTML. They'll work in simple cases, but work incorrectly with the many, many complex cases. You're better off using one of the many HTML parsing modules on CPAN. ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer Note: All code is untested, unless otherwise stated	[reply]
Re: Regexp to ignore HTML tags by Jenda (Abbot) on Mar 31, 2003 at 15:39 UTC
Quite likely others are right that you should use an existing HTML parser. If for some reason this is an overkill to you or if the string is not really a valid HTML, you may try something like this: `sub PolishHTML { my $str = shift; if ($AllowXHTML) { $str =~ s{(.?)(&\w+;\|&#\d+;\|<\w[\w\d](?:\s+\w[\w\d](?:\s=\ +s(?:[^" '><\s]+\|(?:'[^']')+\|(?:"[^"]")+))?)\s/?>\|</\w[\w\d]>\|$) +} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~ +').$2}gem; } else { $str =~ s{(.?)(&\w+;\|&#\d+;\|<\w[\w\d](?:\s+\w[\w\d](?:\s=\ +s(?:[^" '><\s]+\|(?:'[^']')+\|(?:"[^"]")+))?)\s>\|</\w[\w\d]>\|$)} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~ +').$2}gem; } return $str; }` [download] (This function escapes all characters special to HTML that are not part of valid HTML tags or entities.) Jenda Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. -- Rick Osborne Edit by castaway: Closed small tag in signature	[reply] [d/l]
Regexp to ignore HTML tags by markhoy (Novice) on Apr 01, 2003 at 14:21 UTC
Thanks All!! Will give all suggestions a try ASAP.	[reply]