Re: regex: deleting empty (x)html tags

Well after playing with it awhile, I came up with this code:

    $self->{'prcssed_txt'} =~ s/<a(\s+[^<>]*[name|id]=[^<>]+>\s*<\/a>)
+/<<$1/g;
    while ($self->{'prcssed_txt'} =~ s/<([^<>]+)(\s+[^<>]+)*>\s*<\/\1>
+\n?//) { }
    $self->{'prcssed_txt'} =~ s/<</<a/g;
[download]

It more or less treats a id & a name as a special case; munges them slightly so the empty tag stripper doesn't get them, and then unmunges them.

Since they are a special case, I think this is a reasonable way of handling this, and doesn't add the complexities of either doing this at the same time I'm parsing the html (not exactly a simple process, even without that, since the spec calls for parsing very broken html) or adding a whole second pass though a parser.

Comment on Re: regex: deleting empty (x)html tags Download Code