jai_dgl has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a html source content I need to strip all the HTML tag except (preserve superscript tag both open and close) I need a single regex to do it

Replies are listed 'Best First'.
Re: Remove all html tag Except 'sup'
by moritz (Cardinal) on Jun 20, 2008 at 09:48 UTC
    A single regex is a bad way for the general case, but since you asked for it, I'll try:
    my $tag = qr{ <(?>/?) # tag start (?!sup) # not a <sup> or </sup> tag [^>]* # everything but the tag end+ > # end of tag }xi; $str =~ s/$tag//g;

    This is untested and probably a bad idea, but you asked for it ;-)

    Update: fixed regex to preserve closing tag. Stupid me. It tried to match <c/sup</c>, failed, backtracked, and matched that whole substring with the [>]* rule. Non-backtracking groups around /? prevents that. In perl 5.10 you could also say /?+ instead.

      Hi its working fine , but not preserving end of sup
        Its missing a grouping parens
        # you need extra (?:) $tag = qr{</?(?:(?!sup)[^>])*>}i;
        You're right, I updated my regex - should work now.
Re: Remove all html tag Except 'sup'
by marto (Cardinal) on Jun 20, 2008 at 09:49 UTC
    Why do you 'need a single regex to do' this? What context will the html superscript tag have when you remove all of the rest of the html tags in a document? Have you looked at any of the modules on cpan designed to make working with html? There are modules with tried and tested code to properly deal with tagged documents, I would suggest you spend some time investigating this route. This sort of question is asked every so often, super search should return some useful results.

    Martin
Re: Remove all html tag Except 'sup'
by apl (Monsignor) on Jun 20, 2008 at 09:46 UTC
    Why do you need a single Regex to do it?

    You might want to take a look at the CPAN HTML-Manipulator class. I haven't used it, so I can't swear by it.

Re: Remove all html tag Except 'sup'
by Your Mother (Archbishop) on Jun 20, 2008 at 16:38 UTC

    I second marto and others. Don't use regexes on HTML unless you know the HTML in questions intimately and know regular expressions well. This lucky coincidence is rare in the wild. Here's a somewhat flexible example with HTML::TokeParser.

    use strict; use warnings; use HTML::TokeParser; my @tags = @ARGV; @tags || die "Give a list of tags to retain.\n"; my %keep = map { lc($_) => 1, lc("/$_") => 1 } @tags; my $p = HTML::TokeParser->new(\*DATA); while ( my $t = $p->get_token ) { if ( $t->[0] =~ /S|E/ and $keep{$t->[1]} ) { print $t->[-1]; } elsif ( $t->[0] eq 'T' ) { print $t->[1]; } } __DATA__ <div> <h1>Bang!<sup>1</sup></h1> <p>Did <i>italic</i> and <a href="/uri">link with <b>bold</b> inside it</a>.</p> <a href="/top-level">naked link</a> <p><i>The</i> <b>content</b> of the body <sup>element</sup> is displayed in your <span>browser</span>.</p> </div>

    And because I have it lying around, here is the obverse -- a tag stripper -- with XML::LibXML.

    use warnings; use strict; use XML::LibXML; my @strip = @ARGV; @strip || die "Give a list of tags to strip.\n"; my $parser = XML::LibXML->new(); $parser->line_numbers(1); my $raw = join '', <DATA>; my $doc = $parser->parse_html_string($raw); my $root = $doc->documentElement(); for my $strip ( @strip ) { for my $node ( $root->findnodes("//$strip") ) { my $fragment = $doc->createDocumentFragment(); $fragment->appendChild($_) for $node->childNodes; $node->replaceNode($fragment); } } print $doc->serialize(1); __END__ <div> <h1>Bang!<sup>1</sup></h1> <p>Did <i>italic</i> and <a href="/uri">link with <b>bold</b> inside it</a>.</p> <a href="/top-level">naked link</a> <p><i>The</i> <b>content</b> of the body <sup>element</sup> is displayed in your <span>browser</span>.</p> </div>
Re: Remove all html tag Except 'sup'
by waldner (Beadle) on Jun 20, 2008 at 09:33 UTC
    Can you post an example? (input and expected output)
      '

      This is
      21st

      '; plz view as HTML Output should be 'This is 21st'
Re: Remove all html tag Except 'sup'
by Jenda (Abbot) on Jun 21, 2008 at 10:40 UTC

    Let me see ... what do you expect to get from this?

    Some text. <script language="JavaScript"> function foo() { ... } </script> blah blah blah.
    Does the regexp handle this? blah <input type="text" name="foo" value="paul > martin">
    What about this: foo <!-- comment 10<start --> blah <sup>
    What about all kinds of different things you can find in HTML?

    Forget regexps ... the regexp that would strip everything you need and keep everything you need would be insanely complex. And I would not believe it anyway. Use a module. Eg.

    use HTML::JFilter; #http://jenda.krynicky.cz/#HTML::JFilter my $filter = new HTML::JFilter <<'*END*' sup *END* $filteredHTML = $filter->doSTRING($enteredHTML);