Empty tags cleaner regex (for pre-validated XHTML)

Your Mother has asked for the wisdom of the Perl Monks concerning the following question:

The condition/given for this code is that all input to it has already been validated/cleaned via an HTML::TokeParser filter. So the only tags it will see are those that are correct, correctly nested, and don't contain tricky stuff like '"\>".'

Given that, is this a sound approach to removing empty tags? Better ideas? I know it could be done within the TokeParser routines but those are already a bit complex with two big named loops and I'd rather do it the easy way, if it's reasonable, than add in more logic or resort to unget_token() back and forth.

1 while $body =~ s,<(\w+)[^>]*>\s*</\1\s*>,,g;

Comment on Empty tags cleaner regex (for pre-validated XHTML) Download Code

Replies are listed 'Best First'.
Re: Empty tags cleaner regex (for pre-validated XHTML) by GrandFather (Saint) on Dec 13, 2005 at 03:58 UTC
So you want to remove empty paragraph elements that provide vertical formating and empty table cells and so on? If your HTML is clean (passes through HTML::Lint without errors) you may be able to use XML::Twig for the processing. Alternatively HTML::TreeBuilder may give more milage than TokeParser if you are doing a lot of editing. DWIM is Perl's answer to Gödel	[reply]
Re^2: Empty tags cleaner regex (for pre-validated XHTML) by Your Mother (Archbishop) on Dec 13, 2005 at 18:39 UTC
I've never really used TreeBuilder; once months ago for a test. Thanks for suggesting it. It might let me do everything at once. In this case, it's for user provided input; no tables will be allowed and typographically speaking, paragraph tags are an invalid way to achieve formatting. Should be CSS or at worst, <br />s.	[reply]