How do I remove HTML from a string?


Come for the quick hacks, stay for the epiphanies.
	PerlMonks

How do I remove HTML from a string?

by faq_monk (Initiate)

on Oct 08, 1999 at 00:32 UTC ( [id://758]=perlfaq nodetype: print w/replies, xml )

Need Help??

Current Perl documentation can be found at perldoc.perl.org.

Here is our local, out-dated (pre-5.6) version:

The most correct way (albeit not the fastest) is to use HTML::Parse from CPAN (part of the libwww-perl distribution, which is a must-have module for all web hackers).

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus folks forget to convert entities, like < for example.

Here's one ``simple-minded'' approach, that works for most files:

    #!/usr/bin/perl -p0777
    s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz .

Here are some tricky cases that you should think about when picking a solution:

    <IMG SRC = "foo.gif" ALT = "A > B">

    <IMG SRC = "foo.gif" 
         ALT = "A > B">

    <!-- <A comment> -->

    <script>if (a<b && a>c)</script>

    <# Just data #>

    <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

    <!-- This section commented out.
        <B>You can't see me!</B>
    -->

Domain Nodelet^?

www.com | www.net | www.org

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others musing on the Monastery: (6)

As of 2024-04-23 20:04 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found