CrysC has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a html parser for a cms project and need to delete empty html tags recursively (as in <p><i></i></p> would all disappear) except if the tag is <a name="..."> or <a id="...">

I've got one that works but my code seems far too fragile for my taste.

while ($self->{'prcssed_txt'} =~ s/<([^<>]+)([\s]+[^<>]+)*>\s*<\/\1>\n?//) { }
This first attempt strips all empty tags.

while ($self->{'prcssed_txt'} =~ s/<([^<>a][^<>]*)([\s]+[^<>]+)*>\s*<\/\1>\n?//) { }
This one strips all empty tags that don't start with a.

while (($self->{'prcssed_txt'} =~ s/<([^<>a][^<>]*)([\s]+[^<>]+)*>\s*< +\/\1>\n?//) || ($self->{'prcssed_txt'} =~ s/<a href[^<>]*>\s*<\/a>//) || ($self->{'prcssed_txt'} =~ s/<abbr[^<>]*>\s*<\/abbr>//) || ($self->{'prcssed_txt'} =~ s/<acronym[^<>]*>\s*<\/acron +ym>//)) { }

This one does what I want it to, but it's fragile -- if I change what order the attributes for an a tag are returned from my parser it breaks; if I add another tag that begins with a to the list of allowable tags, it breaks.

Replies are listed 'Best First'.
Re: regex: deleting empty (x)html tags
by Abigail-II (Bishop) on Feb 14, 2004 at 23:39 UTC
    Don't try to solve problems like this with regexes. While I won't claim it's impossible, it's certainly not easy, efficient or maintainable. Your regex is going to be long, and full of special constructs.

    Solve this using an HTML parser. Once you have a parse tree, removing empty elements is trivial.

    Abigail

      I've concidered sending it back though HTML::PullParser a second time to remove the empty tags, but that seems like quite a bit of excess processing simply to remove empty tags.

      I could be wrong though -- just because it will take on the order of 20 times as much code as the regex doesn't mean it will actually be slower.

      Edit: This is document fragment with no containing tag, so any of the tree-based parsers will barf afaik

      Edit2: I'm not refusing to concider non-regex solutions, it's just that loading yet another parser or going though HTML::PullParser again doesn't seem to be a very efficent way of doing it...

Re: regex: deleting empty (x)html tags
by exussum0 (Vicar) on Feb 14, 2004 at 22:52 UTC
    HTML and XML are context sensitive languages. A regular expression works on a regular language, where the order of the thing you are matching matters. In context sensitive languages, what you are looking for may have a different meaning in the context of where you found it. Though perl has suped-up regular expressions, they cannot describe everything, especially when order matters. Take the balancing of XML tags for instance.

    A problem you haven't shown that occurs with context sensitive languages, is if (b)(i)(/b)(/i) is valid then fail. I know I know, it's not what you were asked. But the context in which how things are used in relation to everything else. You MAY be better off doing something like...

    sub figureOut { while(my $text=~s/(<.?*>)/) { my $tag = $1; if( $tag=~s/\// ) { my matchTag = pop(@tags); die('Bad HTML'); if( $1 ne $matchTag ); } else { push(@tags,$1) figureOut($text); } } }
    I haven't run the code, but you get the idea. This program theoretically should figure out the balancing of tags, probably what is most fragile about your program. But somewhere in here, you should be able to do empty content.

    Anyway, regular expressions, have limited scope in terms of context. They can tell if text has things in a certain order, but not if those things are in order depending on their context. Perl's re's can do it to some degree, but it's no where complete... like tag balancing.

    Update: Use the power of english in the first paragraph.


    Play that funky music white boy..

      I should have specified -- the code here is already parsed into valid xHTML — therefore there are no inverted tags.

      In fact one of the major things I want to do here is clean out artifacts of the parsing process because for instance your <i><b> </i></b> would have become <i><b> </b></i><b></b> in the parsing process.

Re: regex: deleting empty (x)html tags
by jeffa (Bishop) on Feb 15, 2004 at 15:31 UTC

    So ... now that you have your question answered, let me ask one.

    Why do you need to delete these empty elements in the first place?!?

    I use a little tool called HTLMArea for CMS sites. This allows the CMS user to enter HTML as HTML by turning <textarea>'s into WYSIWYG editors.

    Just seems to me that you are trying to solve the symptom and not the problem, but i am glad you found your solution. :)

    Oh ... and by the way ... did you try HTML Tidy first? It strips those empty elements for you. ;)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    

      In this case spec calls for an upload file function -- if the user was typing this into a text box, believe me I would use HTMLArea -- though given that it doesn't work on all browsers, unless you can honestly say that all the program's users all the time will have access to Moz or IE (and for instance my local public library, if you take a Mac you've only got Safari to work with), you really should have an alternate means of input anyway.

      HTMLTidy is also a case of not working with spec (though goddess it would make my life simpler), because it requires access to the command line to install. This is a low end CMS -- assuming the required modules exist or are installed for you by your host (and nothing called for by this is too outre) all you'd need to use this is ftp access and permission to run cgis.

        Besides the fact that you never answered why you need to strip empty elements, or that you don't even seem to care about where they come from, i would say that given your set of constraints that you are stuck with the solution you have discovered for yourself. Personally, i strive to work in an environment where i control such issues, but then again, i am not developing "web tools" for the general public to use. My audience for my free code is experienced Perl programmers (i write CPAN modules) and i make a living by working for clients with specific needs.

        If i were pushing this product out, i would look into bundling HTMLTidy along with the application, but that's just me. ;)

        Best of luck to you, and if we help you in the future (you solved this one yourself ... this time >:)) and you make some money as a result of it, please don't be shy. ;)

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
Re: regex: deleting empty (x)html tags
by CrysC (Novice) on Feb 15, 2004 at 03:16 UTC

    Well after playing with it awhile, I came up with this code:

    $self->{'prcssed_txt'} =~ s/<a(\s+[^<>]*[name|id]=[^<>]+>\s*<\/a>) +/<<$1/g; while ($self->{'prcssed_txt'} =~ s/<([^<>]+)(\s+[^<>]+)*>\s*<\/\1> +\n?//) { } $self->{'prcssed_txt'} =~ s/<</<a/g;

    It more or less treats a id & a name as a special case; munges them slightly so the empty tag stripper doesn't get them, and then unmunges them.

    Since they are a special case, I think this is a reasonable way of handling this, and doesn't add the complexities of either doing this at the same time I'm parsing the html (not exactly a simple process, even without that, since the spec calls for parsing very broken html) or adding a whole second pass though a parser.