tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE: I solved the problem. Convert with html tidy. See my comments below if you want a script out of the box... ******************
original post
******************
Monks, I have a question about keeping quotes in Treebuilder-parsed html.

HTML::Treebuilder strips quotes around numerical attributes, as can be seen in the example below, which includes the test I want passed.

According to the client I've got, without quotes around all attributes, we have valid html but not xhtml. One workaround is changing the behavior of HTML::Treebuilder, or perhaps the as_html method of HTML::Element. Another would be an html to xhtml converter, if anyone can recommend one.

I appreciate any leads for solving this.

use strict; use warnings; use HTML::Treebuilder; use HTML::Element; use Test::More qw(no_plan); my $html= '<table> <tr> <td valign="top" width="10"></td> </tr> </table>'; my $treeroot = HTML::TreeBuilder->new; my $html_tree = $treeroot->parse( $html); print $html_tree->as_HTML(); #the quotes around 10 get stripped away. #(the quotes around top are kept.) my $wanted = q{<html><head></head><body><table><tr><td valign="top" wi +dth="10"></td></tr></table></body></html>}; my $got = $html_tree->as_HTML(); is($got, $wanted);
********************* UPDATE: script to do this with tidy, quoting from the how-to linked to above.
my (@files, @rippers); @files = Filecontrol::get_files($dir); foreach my $file (@files) { `tidy -asxml -m $file`; }
"The spaces are important and so is every consonant. What does all that code mean? First, tidy identifies the program to use. -asxml instructs Tidy to convert the HTML document to XHTML. -m tells the program to modify the document in its current location, and c:\XHTML\tidy.htm is the location of the messy document to be converted."

Replies are listed 'Best First'.
Re: Keep quotes around numerical attributes after parsing with HTML::Treebuilder?
by davorg (Chancellor) on Jul 19, 2005 at 09:58 UTC

    Sounds like the module is doing the right thing. Numeric attribute values don't need to be quoted in HTML. Perhaps you want the as_XML method instead.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      Many thanks, I had overlooked that method. I'm unclear though -- is html in xml format the same as xhtml, or is the mapping messier? In other words, does as_xml also mean xhtml as well? If yes, I could use this rather than the tidy solution I explained above on the update.
        XHTML is HTML 4.01 conforming to the XML standards.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Keep quotes around numerical attributes after parsing with HTML::Treebuilder?
by dorward (Curate) on Jul 19, 2005 at 13:51 UTC

    If you look at the source to HTML::Element, this would appear to be handled by this line of code:

    if ($val !~ m/^[0-9]+$/s) { # quote anything not purely numeric

    Altering that will (for a value of "will" equivalent to five minutes poking at source code and not doing any testing whatsoever) allow you to quote any value. I might look into writing a patch and submitting it to the author at some stage.

      +++ :)