Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Regexps to change HTML tags/attributes

by Ovid (Cardinal)
on Aug 27, 2003 at 16:00 UTC ( [id://287074]=note: print w/replies, xml ) Need Help??


in reply to Regexps to change HTML tags/attributes

As a general rule, don't use regular expressions to parse HTML. You typically want a parser. Here's a short example that will remove all anchor tags (beginning and ending) and also change font sizes (though you should really use CSS) and delete the "alt" attribute of images (which you also shouldn't do, but it's here as an example):

use HTML::TokeParser::Simple 2.1; my $parser = HTML::TokeParser::Simple->new($html_file); my $new HTML = ''; while (defined(my $token = $parser->get_token)) { next if $token->is_tag('a'); # strip anchor tags if ($token->is_start_tag('font')) { $token->set_attr('size' 7); } if ($token->is_tag('img')) { $token->delete_attr('alt'); } $html .= $token->as_is; } open HTML, ">", $new_html_doc or die "Cannot open ($new_html_doc) for +writing: $!"; print HTML $html; close HTML;

As a side note, if you want your HTML "cleaned up" a little bit, prior to the $html .= $token->as_is; line, add:

$token->rewrite_tag;

That will preserve and double-quote the values, automatically lowercase the tag name and attribute names (as they properly should be) and preserve an ending forward slash if it's used in a self closing tag:

# before <img SRC=foo.jpg height='13' width=14 ALT="SOME alt Value +" /> # after <img src="foo.jpg" height="13" width="14" alt="SOME alt Value +" />

This method is automatically called on tags that have attributes added, changed, or deleted.

In other words, this is a very common task and HTML::TokeParser::Simple, version 2.1 does all of that for you and then some.

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://287074]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-03-28 23:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found