Re: regex: deleting empty (x)html tags

HTML and XML are context sensitive languages. A regular expression works on a regular language, where the order of the thing you are matching matters. In context sensitive languages, what you are looking for may have a different meaning in the context of where you found it. Though perl has suped-up regular expressions, they cannot describe everything, especially when order matters. Take the balancing of XML tags for instance.

A problem you haven't shown that occurs with context sensitive languages, is if (b)(i)(/b)(/i) is valid then fail. I know I know, it's not what you were asked. But the context in which how things are used in relation to everything else. You MAY be better off doing something like...


sub figureOut
{
    while(my $text=~s/(<.?*>)/)
    {
       my $tag = $1;
       if( $tag=~s/\// )
       {
          my matchTag = pop(@tags);
          die('Bad HTML'); if( $1 ne $matchTag );
       }
       else
       {
          push(@tags,$1)
          figureOut($text);
       }

    }
}
[download]

I haven't run the code, but you get the idea. This program theoretically should figure out the balancing of tags, probably what is most fragile about your program. But somewhere in here, you should be able to do empty content.

Anyway, regular expressions, have limited scope in terms of context. They can tell if text has things in a certain order, but not if those things are in order depending on their context. Perl's re's can do it to some degree, but it's no where complete... like tag balancing.

Update: Use the power of english in the first paragraph.

Play that funky music white boy..

Comment on Re: regex: deleting empty (x)html tags Download Code

Replies are listed 'Best First'.
Re: Re: regex: deleting empty (x)html tags by CrysC (Novice) on Feb 14, 2004 at 23:04 UTC
I should have specified -- the code here is already parsed into valid xHTML — therefore there are no inverted tags. In fact one of the major things I want to do here is clean out artifacts of the parsing process because for instance your <i><b> </i></b> would have become <i><b> </b></i><b></b> in the parsing process.	[reply]