![]() |
|
Problems? Is your data what you think it is? | |
PerlMonks |
How do I remove HTML from a string?by faq_monk (Initiate) |
on Oct 08, 1999 at 00:32 UTC ( #758=perlfaq nodetype: print w/replies, xml ) | Need Help?? |
Current Perl documentation can be found at perldoc.perl.org. Here is our local, out-dated (pre-5.6) version: The most correct way (albeit not the fastest) is to use HTML::Parse from CPAN (part of the libwww-perl distribution, which is a must-have module for all web hackers).
Many folks attempt a simple-minded regular expression approach, like
Here's one ``simple-minded'' approach, that works for most files:
#!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz . Here are some tricky cases that you should think about when picking a solution:
<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif" ALT = "A > B">
<!-- <A comment> -->
<script>if (a<b && a>c)</script>
<# Just data #>
<![INCLUDE CDATA [ >>>>>>>>>>>> ]]> If HTML comments include other tags, those solutions would also break on text like this:
<!-- This section commented out. <B>You can't see me!</B> -->
|
|