You're better off using one of the HTML parsing modules.
No you're not.
<rant>
This mantra is getting very old. In this particular instance the OP is asking about removing <u> and </u>, not about HTML parsing in the general case. Let's look at the issues in turn:
- <u> may be present in attribute values. That's like <image src="foo.jpg" alt="<u>"> for those following on at home. Yes, this is legal HTML, but every web monkey I have spoken to about this, without exception, has always said "I didn't even know you could do that". So in Real Life this is hardly ever arises.
- <u> may be present in comments. The comments may be there because part of the page has been temporarily commented out. But guess what? They probably want the replacements made in the comments, anyway!
- And as for CDATA sections, most people without an SGML background don't even know what they do, and don't have them in their HTML as a rule.
Wanting to strip out underline tags is therefore about as trivial as it gets. Plus in absence of evidence to the contrary, the person is in control of the HTML and has a pretty good idea of what's in there. In this case, there are no attribute values to worry about and the element name is only one character long. You probably don't even have to worry about the tag wrapping from one line to the next. It really doesn't get any easier than this.
An entire directory of files can be done with the following one-liner:
perl -i.bak -pe 's/<\/?u>//gi' *.html
Saying that one you needs a parser to do this is just spreading FUD and making things seem more complicated than they need to be. Do simple jobs simply, and keep it simple, stupid. Save the parser approach for something hard, like converting a font-marked-up page into CSS.
</rant> | [reply] [d/l] [select] |
That's a fair cop. But here is another ... instead of just
showing how to strip <u> tags from an HTML document,
wouldn't (if CSS is a valid solution) it be better to show
how to override <u> tags instead?
u { text-decoration: none; }
I understand your rant grinder, but if you have been
following Tricky's problems ... you see that he just keeps
adding more to the list of items to be removed from his
HTML. Pretty soon his list of regexes is going to be hairy.
But then again, maybe this is better for new-comers
... you don't really appreciate a good thing until you
really need it. ;)
UPDATE:
From Tricky's home node:
I aim to complete my MSc project by Jan 2004, which involves
parsing HTML pages in a proxy server, stripping tags which
impede Web access for visually impaired people (including
dyslexics and colour blind), and returning to the client.
Yep, sounds like Tricky sure could use
HTML::Santizer or HTML::Scrubber ... but
i would be tempted to slay this dragon with
XML::LibXML and XML::LibXSLT myself ...
jeffa
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)
| [reply] [d/l] |
Using an XML module to parse HTML is an exercise in long, excruciating, ultimately futile pain. There is a pehnomenal amount of HTML that just isn't well-formed that you just aren't going to get anywhere useful this way with anything other than HTML you generate yourself.
| [reply] |
And as for CDATA sections, most people without an SGML background don't even know what they do, and
don't have them in their HTML as a rule.
You mean people who put SCRIPT or STYLE
elements in their HTML have an SGML background? Wow, SGML
is far more popular than I thought.
Plus in absence of evidence to the contrary,
the person is in control of the HTML and has a pretty good idea of what's in there.
He may, but we don't. He didn't tell us what's in the HTML
file. It's easy to just assume things, but I can play that
game as well. Just assume there's no <u> present
and do nothing! Assuming things without stating what you
assume is pointless. Furthermore, the OP asks whether the
trivial regex is the best way, or if there's another way.
Hence that my answer starts with For general HTML files.
Besides, if the OP is really in control of what's in the
HTML files, the best answer is to not put in stuff in the
files that you don't want to have there.
Abigail
| [reply] [d/l] |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1
+">
</head>
<body>
<u style="text-decoration: underline overline;" title="> bob">bob</u>
</body>
</html>
Hope this helps.
antirice The first rule of Perl club is - use Perl The ith rule of Perl club is - follow rule i - 1 for i > 1 | [reply] [d/l] |
But oh, for the sport of it. | [reply] |