As a general rule, don't use regular expressions to parse HTML. You typically want a parser. Here's a short example that will remove all anchor tags (beginning and ending) and also change font sizes (though you should really use CSS) and delete the "alt" attribute of images (which you also shouldn't do, but it's here as an example):

use HTML::TokeParser::Simple 2.1; my $parser = HTML::TokeParser::Simple->new($html_file); my $new HTML = ''; while (defined(my $token = $parser->get_token)) { next if $token->is_tag('a'); # strip anchor tags if ($token->is_start_tag('font')) { $token->set_attr('size' 7); } if ($token->is_tag('img')) { $token->delete_attr('alt'); } $html .= $token->as_is; } open HTML, ">", $new_html_doc or die "Cannot open ($new_html_doc) for +writing: $!"; print HTML $html; close HTML;

As a side note, if you want your HTML "cleaned up" a little bit, prior to the $html .= $token->as_is; line, add:

$token->rewrite_tag;

That will preserve and double-quote the values, automatically lowercase the tag name and attribute names (as they properly should be) and preserve an ending forward slash if it's used in a self closing tag:

# before <img SRC=foo.jpg height='13' width=14 ALT="SOME alt Value +" /> # after <img src="foo.jpg" height="13" width="14" alt="SOME alt Value +" />

This method is automatically called on tags that have attributes added, changed, or deleted.

In other words, this is a very common task and HTML::TokeParser::Simple, version 2.1 does all of that for you and then some.

Cheers,
Ovid

New address of my CGI Course.


In reply to Re: Regexps to change HTML tags/attributes by Ovid
in thread Regexps to change HTML tags/attributes by Tricky

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.