Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

From an HTML string I would like to remove all html tags but keep only  <font=xyz ...> and </font> tags. Any suggestions. Actually as I am writing this, whatever expression you are using for processing CODE tag here on this website will work help me too. I can have a while loop and when it detects start tag and setup a flag and when it is done and reset the flag. But was trying to see if there are any different ways

Replies are listed 'Best First'.
Re: keeping one pattern but removing all else from a string
by stefp (Vicar) on Jul 13, 2002 at 03:10 UTC
Re: keeping one pattern but removing all else from a string
by Cody Pendant (Prior) on Jul 13, 2002 at 04:07 UTC
    You want to keep all the contents, but remove all tags, except for FONT tags?

    I do this the quick and dirty way

    Replace all <FONT> tags with a placeholder tag: {{{font}}} for instance.

    Then kill all the other tags.

    Then put the FONT tags back.

    $html =~ s/<($tag[^>]*?)>/{{{$1}}}/sgi;# temp-encode starting tags $html =~ s/<\/($tag[^>]*?)>/{{{\/$1}}}/sgi; # temp-encode ending tags $html =~ s/<[^>]*?>//sgi;# kill all remaining tags $html =~ s/\{{3}/</sgi;# re-encode '<' tags $html =~ s/\}{3}/>/sgi;# re-encode '>'tags

    --
    ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;
      Thank you. The solution by Cody Pendant above is what i was looking for and it worked fine as I wanted.
Re: keeping one pattern but removing all else from a string
by vladb (Vicar) on Jul 13, 2002 at 03:13 UTC
    Here's another (less appealing ... err.. worse) solution using raw regular expressions:

    Try this...
    my $font_tag_match = m{ # Tags in pairs like <foo>content</foo> \< \s* (<font[a-z:\=]>+) [ \s*<[a-z:]>* \s* = \s* [ ' <[^']>* ' | " <[^"]>* " ] ]* \s* \> [ <[^<>]>* | <xml> ]* \< \s* / \s* font \s* \> }x; # remove all text that is not between a pair of <font> tags.. $input_data =~ s/^$font_tag_match//mg;
    Unfortunately I had to make up the code without any bit of testing as I don't have the means to do so on this particular PC (Have to reboot my Win2000 OS as the MS-DOS command prompt wouldn't work for some reason ;/)

    _____________________
    # Under Construction
      Thanks. But I need to keep the content ,without html outside start and end tags. The solution above will preserve everything between font tags but remove the content outside completely. The way CODE tag implemented here is very similar to what I want. It removes all html outside code but preserves it inside and makes it in different font.