in reply to Cleaning up HTML
This will churn through your HTML (either a complete HTML page or an HTML snippet), tidy up the HTML, fix tag nesting, remove scripts, remove unknown attributes etc.
Through the Rules => {} parameter, you can specify exactly what tags and attributes you want to allow through, adding regexes or callbacks to customise the results.
Clint
- fixing tag soup, like "<b><i>foo</b></i>"
Yes
- avoid inline elements wrapped around block elements, for example a "p" tag wrapped in "font" tags
Yes
- stripping (some) empty or whitespace-only elements, such as "<b></b>"
Yes with a callback such as:
Rules => { * => sub { my ($filter,$element) = @_; return $element->{content} =~ m/\S/ ? 1 : 0 } }- removing unnecessary tags, for example, if there's a "<font face="Verdana" size="1">" tag, strip it and its corresponding "</font>" tag, because that's my default font and size for the table - but leave in a font tag if there are any other attributes after dropping the ones with the default value: for example "<font face="Verdana" size="1" color="#FF0000">" -> "<font color="#FF0000">"
with a callback like:
Rules => { font => sub { my ($filter,$element) = @_; my $attr = $element->{attr}; delete $attr->{size} if $attr->{size} && $attr->{ +size} eq 1; delete $attr->{face} if $attr->{face} && lc($attr +->{face}) eq 'verdana'; return keys %$attr ? 1 : 0; } }- moving "<br>" tags out of links, when at the edge of the link text: <a href="linkto">link text<br></a> -> <a href="linkto">link text</a><br>
This is the only one I don't have a callback for. It would need to be handled with a regex run at the end. That said, it'd be pretty easy to change sub output_stack_entry to allow you to pass back the literal HTML to use instead of reassembling the components, for instance:
Rules => { a => sub { my ($filter,$element) = @_; my $content = $element->{content}; my ($pre,$post) = ('',''); if ($content=~s{^\s*<br />}{}) { $pre = '<br />'; } if ($content=~s{<br />\s*$}{}) { $post = '<br />'; } if ($pre || $post) { $element->{literal} = $pre. '<a ' . $self->_hss_join_attribs( $eleme +nt{attr} ) . '>' . $content . '</a>' . $post; } return 1; } }
Update: I'm the maintainer of HTML::StripScripts, and I added the Rules => {} parameter to HTML::StripScripts, which makes it easier to customise. But all credit for the underlying module must go the Nick Cleaton, the original author, who did a very very good job indeed
Update: Tidied up the HTML
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Cleaning up HTML
by bart (Canon) on Apr 22, 2008 at 09:30 UTC | |
by clinton (Priest) on Apr 22, 2008 at 09:51 UTC |