Re^3: The mostly used xml parser

A month ago I'd not have mentioned XML::Twig - XML/HTML::TreeBuilder was my hammer. Having seen mention of XML::Twig] so frequently I've started using it too and it does do filtering (my first application for it) very nicely!

What I'd like would be an HTML::Twig. Might not be as "easy" to write though :).

Perl is Huffman encoded by design.

Comment on Re^3: The mostly used xml parser

Replies are listed 'Best First'.
Re^4: The mostly used xml parser by mirod (Canon) on Oct 05, 2005 at 20:50 UTC
Processing HTML is not too difficult, as long as tidy is around: `perl -MXML::Twig -e'open( my $fh, "tidy -asxml -quiet pm.html 2>/dev/null\| ") or die $!; XML::Twig->parse( $fh)'` Keeping the output similar to the input is of course much harder, as in this case XML::Twig does not see the original file. Here I pay for the fact that XML::Twig does not accept a SAX stream as input, or I could use XML::LibXML::SAX and get HTML parsing for free (SAX was quite new when I started writing XML::Twig, and now it is coupled very strongly with XML::Parser).	[reply] [d/l]
Re^5: The mostly used xml parser by GrandFather (Saint) on Oct 05, 2005 at 21:13 UTC
Excellent point - using tidy as a preprocessor goes a long way to bridging the gap. It is also interesting to note that there is a tidy.pm available, although the Active State ppm install of it doesn't manage to hook the documentation into Active State's Perl documentation :(. Perl is Huffman encoded by design.	[reply]
Re^6: The mostly used xml parser by mirod (Canon) on Oct 06, 2005 at 08:02 UTC
It must be from HTML::Tidy, which is an HTML checker: it doesn't return the XHTML generated by `tidy`, just the error messages. I could provide a sortcut for using `tidy` (or `xmllint`, which can do the same thing) though, provided it is either in the path or you give the path to the executable. I'll add this to the next version, thanks for the idea.	[reply]