in reply to Re: Removing underline tags with regexp (is a good idea)
in thread Removing underline tags with regexp

That's a fair cop. But here is another ... instead of just showing how to strip <u> tags from an HTML document, wouldn't (if CSS is a valid solution) it be better to show how to override <u> tags instead?
u { text-decoration: none; }
I understand your rant grinder, but if you have been following Tricky's problems ... you see that he just keeps adding more to the list of items to be removed from his HTML. Pretty soon his list of regexes is going to be hairy. But then again, maybe this is better for new-comers ... you don't really appreciate a good thing until you really need it. ;)

UPDATE:
From Tricky's home node:

I aim to complete my MSc project by Jan 2004, which involves parsing HTML pages in a proxy server, stripping tags which impede Web access for visually impaired people (including dyslexics and colour blind), and returning to the client.
Yep, sounds like Tricky sure could use HTML::Santizer or HTML::Scrubber ... but i would be tempted to slay this dragon with XML::LibXML and XML::LibXSLT myself ...

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Replies are listed 'Best First'.
Re: 2Re: Removing underline tags with regexp (is a good idea)
by Elian (Parson) on Sep 02, 2003 at 13:23 UTC
    Using an XML module to parse HTML is an exercise in long, excruciating, ultimately futile pain. There is a pehnomenal amount of HTML that just isn't well-formed that you just aren't going to get anywhere useful this way with anything other than HTML you generate yourself.
        That code'll break quite a few web pages, making them render incorrectly (Where "incorrectly" means they don't function any more) and in some cases change the semantics of the page. (Not that you could necessarily intuit the semantics without knowing the various broken ways that each browser interprets the page, but...) There are a depressing number of pages that are intentionally breaking standards because that's the only way to get them to render properly.