Re: Removing underline tags with regexp
by jeffa (Bishop) on Sep 02, 2003 at 12:53 UTC
|
See? We keep telling you to use a parser for this ... you
keep refusing and you keep coming back with quetions. Next
you'll be wanting to do something to <b> and <i> tags.
:)
In Regexp conundrum you specify that you want to "remove
letter-spacing and word-spacing attributes from my
in-document style sheet." Sounds like you should specify
a second style sheet that looks the way you want
it to. If this is your site ... problem solved, if not,
then you could use regular expressions to change to a
different CSS file. I would use
HTML::TokeParser::Simple. If this is not your site
and the CSS is all inline ... then maybe you could simply
"turn off "CSS. There are so many potential solutions ...
this is why you should be very descriptive in what you are
doing and why you are doing it.
In Re: Problems Trying to Send E-Mail and Insert into Database, you say that "I've managed to extract image
tags and re-write the changes, so I'm heading in the right
direction..." No. You've been going the wrong direction ever
since. :(
In CPAN module download probs..., you do good! You tried to install a CPAN
module, but it didn't work. More than likely, you tried to
install a tar ball (which is for Unix) on a Windows box.
Since you are using Activestate, you should be using
PPM.
barbie gave you a clue about this at Re: CPAN module download probs..., did
you see this after you posted Re: CPAN module download probs...?
Ok, enough tracking. ;) Where am i going with all of this?
It is time for you to learn how to install PPM's. Me? I have
never done it. I use Linux. But, if you invest as much
energy into getting PPM's to work on your computer as you
have trying to take down windmills with a lance (parsing
HTML with regular expressions), you will succeed. And then
you will have a nice set of modules to help you with this
task.
jeffa
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)
| [reply] |
Re: Removing underline tags with regexp
by broquaint (Abbot) on Sep 02, 2003 at 10:32 UTC
|
Or one simple regex replace
s{</?u>}()ig;
See. perlre for more info.
HTH
_________ broquaint | [reply] [d/l] |
|
|
s!<\s*/?\s*u\s*>!!ig;
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] [d/l] |
Re: Removing underline tags with regexp
by Elian (Parson) on Sep 02, 2003 at 12:51 UTC
|
The opinions so far are about right--while in the general case you shouldn't, in the normal case you can. However, I'd consider the regex:
s/<(\/?)u>/<!-- $1 u -->
just so you can reasonably easily find where you nuked things just in case you ended up munching tags you didn't want to. (A good backup of the files wouldn't be out of order either...)
Update:Missing a slash. Fixed. | [reply] [d/l] |
Re: Removing underline tags with regexp
by Abigail-II (Bishop) on Sep 02, 2003 at 10:43 UTC
|
For general HTML files, removing tags cannot be approached
with simple regexes. <u> may be present in
attribute values, comments, or CDATA declared sections.
You're better off using one of the HTML parsing modules.
Abigail | [reply] [d/l] |
|
|
You're better off using one of the HTML parsing modules.
No you're not.
<rant>
This mantra is getting very old. In this particular instance the OP is asking about removing <u> and </u>, not about HTML parsing in the general case. Let's look at the issues in turn:
- <u> may be present in attribute values. That's like <image src="foo.jpg" alt="<u>"> for those following on at home. Yes, this is legal HTML, but every web monkey I have spoken to about this, without exception, has always said "I didn't even know you could do that". So in Real Life this is hardly ever arises.
- <u> may be present in comments. The comments may be there because part of the page has been temporarily commented out. But guess what? They probably want the replacements made in the comments, anyway!
- And as for CDATA sections, most people without an SGML background don't even know what they do, and don't have them in their HTML as a rule.
Wanting to strip out underline tags is therefore about as trivial as it gets. Plus in absence of evidence to the contrary, the person is in control of the HTML and has a pretty good idea of what's in there. In this case, there are no attribute values to worry about and the element name is only one character long. You probably don't even have to worry about the tag wrapping from one line to the next. It really doesn't get any easier than this.
An entire directory of files can be done with the following one-liner:
perl -i.bak -pe 's/<\/?u>//gi' *.html
Saying that one you needs a parser to do this is just spreading FUD and making things seem more complicated than they need to be. Do simple jobs simply, and keep it simple, stupid. Save the parser approach for something hard, like converting a font-marked-up page into CSS.
</rant> | [reply] [d/l] [select] |
|
|
That's a fair cop. But here is another ... instead of just
showing how to strip <u> tags from an HTML document,
wouldn't (if CSS is a valid solution) it be better to show
how to override <u> tags instead?
u { text-decoration: none; }
I understand your rant grinder, but if you have been
following Tricky's problems ... you see that he just keeps
adding more to the list of items to be removed from his
HTML. Pretty soon his list of regexes is going to be hairy.
But then again, maybe this is better for new-comers
... you don't really appreciate a good thing until you
really need it. ;)
UPDATE:
From Tricky's home node:
I aim to complete my MSc project by Jan 2004, which involves
parsing HTML pages in a proxy server, stripping tags which
impede Web access for visually impaired people (including
dyslexics and colour blind), and returning to the client.
Yep, sounds like Tricky sure could use
HTML::Santizer or HTML::Scrubber ... but
i would be tempted to slay this dragon with
XML::LibXML and XML::LibXSLT myself ...
jeffa
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)
| [reply] [d/l] |
|
|
|
|
|
|
|
And as for CDATA sections, most people without an SGML background don't even know what they do, and
don't have them in their HTML as a rule.
You mean people who put SCRIPT or STYLE
elements in their HTML have an SGML background? Wow, SGML
is far more popular than I thought.
Plus in absence of evidence to the contrary,
the person is in control of the HTML and has a pretty good idea of what's in there.
He may, but we don't. He didn't tell us what's in the HTML
file. It's easy to just assume things, but I can play that
game as well. Just assume there's no <u> present
and do nothing! Assuming things without stating what you
assume is pointless. Furthermore, the OP asks whether the
trivial regex is the best way, or if there's another way.
Hence that my answer starts with For general HTML files.
Besides, if the OP is really in control of what's in the
HTML files, the best answer is to not put in stuff in the
files that you don't want to have there.
Abigail
| [reply] [d/l] |
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1
+">
</head>
<body>
<u style="text-decoration: underline overline;" title="> bob">bob</u>
</body>
</html>
Hope this helps.
antirice The first rule of Perl club is - use Perl The ith rule of Perl club is - follow rule i - 1 for i > 1 | [reply] [d/l] |
|
|
But oh, for the sport of it.
| [reply] |