What approach do you recommend for cleaning up snippets of HTML?

I'm currently using HTML::TableExtract to pull data (as HTML) out of html table cells. My problems is that the data contains a lot of cruft, typical leftovers from WYSIWYG tools used to edit the HTML (incl. MS Word), and I'd like to clean it up. That includes:

I'm currently using a custom parser based on HTML::TokeParser::Simple, but

  1. The size is larger than I'd hope for
  2. HTML::TableExtract is already based on a HTML parser (using HTML::Element if I'm not mistaking), so this feels like I'm using too many similar yet different tools on the same project

What do you recommend? Can HTML::Element actually even manage tag soup, or does it require properly nested tags? How easy is it to remove or swap tag layers (to change the order of nesting)?

p.s. Here's the cleanup tool I wrote. It is not as complete as my wishlist.

use HTML::TokeParser::Simple; sub dummy () { # empty token return HTML::TokeParser::Simple::Token::Text->new([ T => '' ]); } sub cleanup_html { my($html) = @_; my $p = HTML::TokeParser::Simple->new(string => $html); my @out; my @font; while(my $t = $p->get_token) { if($t->is_start_tag('font')) { if(($t->get_attr('face')||'') eq 'Verdana') { $t->delete_attr('face'); } if(($t->get_attr('size')||'') eq '1') { $t->delete_attr('size'); } if(%{$t->get_attr}) { push @font, 1; } else { push @font, 0; $t = dummy; } } elsif($t->is_end_tag('font')) { unless(pop @font) { $t = dummy; } } my @append = $t; if($t->is_tag('br')) { @append = (); while(my $T = pop @out) { if($T->is_start_tag and $t->get_tag ne 'p') { unshift @append, $T; } else { push @out, $T; last; } } unshift @append, $t; } elsif($t->is_end_tag and $t->get_tag ne 'p') { my $tag = $t->get_tag; while(my $T = pop @out) { unshift @append, $T; if($T->is_text) { last if $T->as_is =~ /\S/; } elsif($T->is_tag('br')) { shift @append; push @append, $T; } elsif($T->is_start_tag($tag)) { @append = (); last; } elsif($out[-1]->is_tag) { last; } } } push @out, @append; } return join '', map $_->as_is, @out; } my $html = "<font color=\"#0000ff\" face=\"Verdana\" size=\"1\">\n</fo +nt>\n<p align=\"center\"><a href=\"#\"><font color=\"#0000ff\" face=\ +"Verdana\" size=\"1\">&euro; 750aa</font><br /></a></p>"; print cleanup_html($html);
Can you do better (smaller, more powerfuil, easier to extend, ...) or simply something based on a tree of HTML::Element?

In reply to Cleaning up HTML by bart

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.