in reply to Re^3: The mostly used xml parser
in thread The mostly used xml parser

Processing HTML is not too difficult, as long as tidy is around:

perl -MXML::Twig -e'open( my $fh, "tidy -asxml -quiet pm.html 2>/dev/null| ") or die $!; XML::Twig->parse( $fh)'

Keeping the output similar to the input is of course much harder, as in this case XML::Twig does not see the original file.

Here I pay for the fact that XML::Twig does not accept a SAX stream as input, or I could use XML::LibXML::SAX and get HTML parsing for free (SAX was quite new when I started writing XML::Twig, and now it is coupled very strongly with XML::Parser).

Replies are listed 'Best First'.
Re^5: The mostly used xml parser
by GrandFather (Saint) on Oct 05, 2005 at 21:13 UTC

    Excellent point - using tidy as a preprocessor goes a long way to bridging the gap. It is also interesting to note that there is a tidy.pm available, although the Active State ppm install of it doesn't manage to hook the documentation into Active State's Perl documentation :(.


    Perl is Huffman encoded by design.

      It must be from HTML::Tidy, which is an HTML checker: it doesn't return the XHTML generated by tidy, just the error messages.

      I could provide a sortcut for using tidy (or xmllint, which can do the same thing) though, provided it is either in the path or you give the path to the executable. I'll add this to the next version, thanks for the idea.