Honestly, as I read the title of your node, HTML tidy sprang immediately to my mind, as it even has command line switches used to specifically clean up Office HTML. On that website, there is also code on how to call HTML tidy from Perl, including some proposed error checking which seems mostly geared for Unix. On the second thought, it is not really clear why they use the code they use, so I'll post it here, together with my replacement :
## This is what I think is needed beforehand : open( TIDY, "html-tidy $commandline|") or die "Couldn't spawn html-tid +y : $!\n"; my @output; @output = <TIDY>; ## Here begins their code : if (close(TIDY) == 0) { my $exitcode = $? >> 8; if ($exitcode == 1) { printf STDERR "tidy issued warning messages\n"; } elsif ($exitcode == 2) { printf STDERR "tidy issued error messages\n"; } else { die "tidy exited with code: $exitcode\n"; } } else { printf STDERR "tidy detected no errors\n"; }
I think this could simply be done with the following code, but I haven't checked all possible outcomes...
my @output = qx(html-tidy $commandline); my $exitcode = $? >> 8; if ($exitcode == 1) { printf STDERR "tidy issued warning messages\n"; } elsif ($exitcode == 2) { printf STDERR "tidy issued error messages\n"; } else { die "tidy exited with code: $exitcode\n"; }

Wrapping it up, unless you tell us a really convincing reason why html-tidy is not possible (and with not possible I also mean putting html-tidy into a Perl script, writing it out to /tmp, starting it there and afterwards deleting the file again), I'll stick with this solution :-)

perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web

In reply to Re: Converting Word97 (or later) exported HTML to valid HTML by Corion
in thread Converting Word97 (or later) exported HTML to valid HTML by projekt21

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.