Unbalanced Tags

lzcd has asked for the wisdom of the Perl Monks concerning the following question:

Howdy,

I’m in the process of writing what could be considered a poor mans version of Everything and have finally wandered around to the section dealing with submitted HTML stuff (a.k.a Node editing).

While I’m fairly sure I can fiddle around with modules such as HTML:: to sift out any tags outside of the relatively safe ones (eg. Br,hr,p,b,strong and & chars), I am unsure as how to proceed with the whole issue of unmatched tags.

I know there are some HTML tricks that’ll allow the later series of browsers to ‘overlook’ such nasties as unbalanced tags but I would prefer to keep it safe and produce nice clean HTML 1.+ type code.

My current thinking on the subject is going along the lines of creating a small hash to hold a ‘level count’ for each tag, adding or subtracting from the count through the parse process and then dumping a series of close tags for any tags that still appear ‘open’.

This approach, IFAIK, will work fine for the simpler tags, where overlapping is okay but I’m worried about what happens if I ever decide to progress to more complex tags, such as table handing, where the order of closing is important.

Is there a super dooper HTML::Parse->CloseAllDemTagsProperly call that I’ve missed?
In the process of producing the PM site and the like, has somebody refined a handstrung routine to the point where it does everything short of write the legal notice?

Thank you for your Infomercial time.

Comment on Unbalanced Tags

Replies are listed 'Best First'.
Re (tilly) 1: Unbalanced Tags by tilly (Archbishop) on Jan 19, 2001 at 07:15 UTC
You are feel free to use the code I wrote at Why I like functional programming if you like. There are even pointers there on how to extend it to have substantially more sophisticated functionality than Everything does. I would recommend a couple of minor changes though. Very specifically in the scrub_input function at the bottom the main loop should probably be: while ($raw =~ /\G([\w ])/g) { $scrubbed .= $1; my $pos = pos($raw); if ($raw =~ /\G($is_handled)/g) { $scrubbed .= $handler->{ lc($1) }->(\$raw); # See perlre. If the handler matches something of # length 0, it won't match something of length 0 # the next time through. So... pos($raw) = pos($raw); } unless (pos($raw)) { if (length($raw) == $pos) { # EXIT HERE # return $scrubbed . $handler->{post}->(\$raw); } else { my $char = substr($raw, $pos, 1); pos($raw) = $pos + 1; $scrubbed .= &encode_entities($char); } } } [download] because at some point you are likely to want to have an optional handler for returns. Something like this: `sub { my $t_ref = shift; if ($$t_ref =~ /\G( )/g) { my $indent = $1; $indent =~ s/ / /g; return "<br>\n$indent"; } else { confess("This shouldn't happen"); } }` [download] That handler will preserve indents properly. (I say optional because some users do not like that kind of autoformatting, so it should be user-configurable.) Anyways this node will probably make little sense to you without reading the other. Which may take a while to digest...	[reply] [d/l] [select]
Re: Re (tilly) 1: Unbalanced Tags by lzcd (Pilgrim) on Jan 19, 2001 at 07:23 UTC
As always, A true master of the functional. :) My thanks.	[reply]
Re: Unbalanced Tags by AgentM (Curate) on Jan 19, 2001 at 07:44 UTC
This sounds like the perfect job for a stack or LIFO. A similar exercise is parenthesis matching. Push tags that you find onto the stack. If you encounter a closing tag, pop the stack and compare what you popped to the current closing tag. Mismatched tags are bad HTML but rarely harmful. If, at the end, you have leftover tags, you have unclosed tags. Beware! Old standards allow this in a limited form, most notably for <p> tags among others. You'd need to read up on the standards to see which are acceptable and which are not. AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.	[reply]
Re: Re: Unbalanced Tags by lzcd (Pilgrim) on Jan 19, 2001 at 07:48 UTC
I was thinking about doing that also but my brain didn't get around to it until after the post. Doh. Oh for the lack of a brain, the code was lost...	[reply]
Re: Unbalanced Tags by eg (Friar) on Jan 19, 2001 at 07:00 UTC
Well, since I'm such a bastard about this sort of thing, I'd probably just run the submitted html through a validator and complain loudly if it's not perfect. A less draconian measure, though, would be to pipe it through HTML Tidy or something. (I thought I saw something like this on CPAN, but I might have been dreaming. Or drunk. :)	[reply]

AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.