Re^4: The mostly used xml parser

Processing HTML is not too difficult, as long as tidy is around:

perl -MXML::Twig -e'open( my $fh, "tidy -asxml -quiet pm.html 2>/dev/null| ") or die $!; XML::Twig->parse( $fh)'

Keeping the output similar to the input is of course much harder, as in this case XML::Twig does not see the original file.

Here I pay for the fact that XML::Twig does not accept a SAX stream as input, or I could use XML::LibXML::SAX and get HTML parsing for free (SAX was quite new when I started writing XML::Twig, and now it is coupled very strongly with XML::Parser).

Comment on Re^4: The mostly used xml parser Download Code

Replies are listed 'Best First'.
Re^5: The mostly used xml parser by GrandFather (Saint) on Oct 05, 2005 at 21:13 UTC
Excellent point - using tidy as a preprocessor goes a long way to bridging the gap. It is also interesting to note that there is a tidy.pm available, although the Active State ppm install of it doesn't manage to hook the documentation into Active State's Perl documentation :(. Perl is Huffman encoded by design.	[reply]
Re^6: The mostly used xml parser by mirod (Canon) on Oct 06, 2005 at 08:02 UTC
It must be from HTML::Tidy, which is an HTML checker: it doesn't return the XHTML generated by `tidy`, just the error messages. I could provide a sortcut for using `tidy` (or `xmllint`, which can do the same thing) though, provided it is either in the path or you give the path to the executable. I'll add this to the next version, thanks for the idea.	[reply]