HTML and XML are context sensitive languages. A regular expression works on a regular language, where the order of the thing you are matching matters. In context sensitive languages, what you are looking for may have a different meaning in the context of where you found it. Though perl has suped-up regular expressions, they cannot describe everything, especially when order matters. Take the balancing of XML tags for instance.

A problem you haven't shown that occurs with context sensitive languages, is if (b)(i)(/b)(/i) is valid then fail. I know I know, it's not what you were asked. But the context in which how things are used in relation to everything else. You MAY be better off doing something like...

sub figureOut { while(my $text=~s/(<.?*>)/) { my $tag = $1; if( $tag=~s/\// ) { my matchTag = pop(@tags); die('Bad HTML'); if( $1 ne $matchTag ); } else { push(@tags,$1) figureOut($text); } } }
I haven't run the code, but you get the idea. This program theoretically should figure out the balancing of tags, probably what is most fragile about your program. But somewhere in here, you should be able to do empty content.

Anyway, regular expressions, have limited scope in terms of context. They can tell if text has things in a certain order, but not if those things are in order depending on their context. Perl's re's can do it to some degree, but it's no where complete... like tag balancing.

Update: Use the power of english in the first paragraph.


Play that funky music white boy..

In reply to Re: regex: deleting empty (x)html tags by exussum0
in thread regex: deleting empty (x)html tags by CrysC

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.