Remove all html tag Except 'sup'

jai_dgl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Remove all html tag Except 'sup'
by moritz (Cardinal) on Jun 20, 2008 at 09:48 UTC

my $tag = qr{
   <(?>/?)  # tag start
   (?!sup)  # not a <sup> or </sup> tag
   [^>]*    # everything but the tag end+
   >        # end of tag
}xi;
$str =~ s/$tag//g;
[download]

This is untested and probably a bad idea, but you asked for it ;-)

Update: fixed regex to preserve closing tag. Stupid me. It tried to match <c/sup</c>, failed, backtracked, and matched that whole substring with the [>]* rule. Non-backtracking groups around /? prevents that. In perl 5.10 you could also say /?+ instead.

[reply]
[d/l]
[select]

Re^2: Remove all html tag Except 'sup'

by jai_dgl (Beadle) on Jun 20, 2008 at 10:09 UTC

Hi its working fine , but not preserving end of sup

[reply]

Re^3: Remove all html tag Except 'sup'

by Anonymous Monk on Jun 20, 2008 at 10:52 UTC

# you need extra (?:)
$tag = qr{</?(?:(?!sup)[^>])*>}i;
[download]

[reply]
[d/l]

Re^4: Remove all html tag Except 'sup'

by waldner (Beadle) on Jun 20, 2008 at 11:51 UTC

Re^3: Remove all html tag Except 'sup'

by moritz (Cardinal) on Jun 20, 2008 at 10:44 UTC

You're right, I updated my regex - should work now.

[reply]

Re^4: Remove all html tag Except 'sup'

by Anonymous Monk on Jun 20, 2008 at 10:59 UTC

Re: Remove all html tag Except 'sup'
by marto (Cardinal) on Jun 20, 2008 at 09:49 UTC

super search

[reply]

Re: Remove all html tag Except 'sup'
by apl (Monsignor) on Jun 20, 2008 at 09:46 UTC

You might want to take a look at the CPAN HTML-Manipulator class. I haven't used it, so I can't swear by it.

[reply]

Re: Remove all html tag Except 'sup'
by Your Mother (Archbishop) on Jun 20, 2008 at 16:38 UTC

I second marto and others. Don't use regexes on HTML unless you know the HTML in questions intimately and know regular expressions well. This lucky coincidence is rare in the wild. Here's a somewhat flexible example with HTML::TokeParser.

use strict;
use warnings;
use HTML::TokeParser;

my @tags = @ARGV;
@tags || die "Give a list of tags to retain.\n";
my %keep = map { lc($_) => 1, lc("/$_") => 1 } @tags;

my $p = HTML::TokeParser->new(\*DATA);

while ( my $t = $p->get_token )
{
    if ( $t->[0] =~ /S|E/ and $keep{$t->[1]} )
    {
        print $t->[-1];
    }
    elsif ( $t->[0] eq 'T' )
    {
        print $t->[1];
    }
}

__DATA__
<div>
    <h1>Bang!<sup>1</sup></h1>
    <p>Did <i>italic</i> and <a href="/uri">link with <b>bold</b>
       inside it</a>.</p>
 <a href="/top-level">naked link</a>
    <p><i>The</i> <b>content</b> of the body <sup>element</sup> is
       displayed in your <span>browser</span>.</p>
</div>
[download]

And because I have it lying around, here is the obverse -- a tag stripper -- with XML::LibXML.

use warnings;
use strict;
use XML::LibXML;

my @strip = @ARGV;
@strip || die "Give a list of tags to strip.\n";

my $parser = XML::LibXML->new();
$parser->line_numbers(1);

my $raw = join '', <DATA>;
my $doc = $parser->parse_html_string($raw);

my $root = $doc->documentElement();

for my $strip ( @strip )
{
    for my $node ( $root->findnodes("//$strip") )
    {
        my $fragment = $doc->createDocumentFragment();
        $fragment->appendChild($_) for $node->childNodes;
        $node->replaceNode($fragment);        
    }
}

print $doc->serialize(1);

__END__
<div>
    <h1>Bang!<sup>1</sup></h1>
    <p>Did <i>italic</i> and <a href="/uri">link with <b>bold</b>
       inside it</a>.</p>
 <a href="/top-level">naked link</a>
    <p><i>The</i> <b>content</b> of the body <sup>element</sup> is
       displayed in your <span>browser</span>.</p>
</div>
[download]

[reply]
[d/l]
[select]

Re: Remove all html tag Except 'sup'
by waldner (Beadle) on Jun 20, 2008 at 09:33 UTC

Can you post an example? (input and expected output)

[reply]

Re^2: Remove all html tag Except 'sup'

by jai_dgl (Beadle) on Jun 20, 2008 at 09:51 UTC

This is

21^st

^st

[reply]

Re^3: Remove all html tag Except 'sup'

by moritz (Cardinal) on Jun 20, 2008 at 10:02 UTC

plz view as HTML Output

Writeup Formatting Tips

[reply]

Re: Remove all html tag Except 'sup'
by Jenda (Abbot) on Jun 21, 2008 at 10:40 UTC

Let me see ... what do you expect to get from this?

Some text.
<script language="JavaScript">
function foo() {
   ...
}
</script>
blah blah blah.
[download]

blah <input type="text" name="foo" value="paul > martin">

foo  blah <sup>

Forget regexps ... the regexp that would strip everything you need and keep everything you need would be insanely complex. And I would not believe it anyway. Use a module. Eg.

use HTML::JFilter; #http://jenda.krynicky.cz/#HTML::JFilter
my $filter = new HTML::JFilter <<'*END*'
sup
*END*

$filteredHTML = $filter->doSTRING($enteredHTML);
[download]

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]
[select]