UPDATE: I solved the problem. Convert with html tidy. See my comments below if you want a script out of the box... ******************
original post
******************
Monks, I have a question about keeping quotes in Treebuilder-parsed html.

HTML::Treebuilder strips quotes around numerical attributes, as can be seen in the example below, which includes the test I want passed.

According to the client I've got, without quotes around all attributes, we have valid html but not xhtml. One workaround is changing the behavior of HTML::Treebuilder, or perhaps the as_html method of HTML::Element. Another would be an html to xhtml converter, if anyone can recommend one.

I appreciate any leads for solving this.

use strict; use warnings; use HTML::Treebuilder; use HTML::Element; use Test::More qw(no_plan); my $html= '<table> <tr> <td valign="top" width="10"></td> </tr> </table>'; my $treeroot = HTML::TreeBuilder->new; my $html_tree = $treeroot->parse( $html); print $html_tree->as_HTML(); #the quotes around 10 get stripped away. #(the quotes around top are kept.) my $wanted = q{<html><head></head><body><table><tr><td valign="top" wi +dth="10"></td></tr></table></body></html>}; my $got = $html_tree->as_HTML(); is($got, $wanted);
********************* UPDATE: script to do this with tidy, quoting from the how-to linked to above.
my (@files, @rippers); @files = Filecontrol::get_files($dir); foreach my $file (@files) { `tidy -asxml -m $file`; }
"The spaces are important and so is every consonant. What does all that code mean? First, tidy identifies the program to use. -asxml instructs Tidy to convert the HTML document to XHTML. -m tells the program to modify the document in its current location, and c:\XHTML\tidy.htm is the location of the messy document to be converted."

In reply to Keep quotes around numerical attributes after parsing with HTML::Treebuilder? by tphyahoo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.