in reply to Re^2: HTML::Parser fun
in thread HTML::Parser fun

I've had a gander at XML::LibXML but cannot see how to code it to be real-world HTML tolerant (so I can test it and see how tolerant it is).

You can't. At least not in Perl. XML::LibXML uses libxml2, which does the XML, and HTML, parsing. That's what you would need to change.

For the record, when I wanted to add HTML parsing to XML::Twig, I looked at HTML::Parser, XML::LibXML and tidy, and settled on HTML::Parser as the most robust and easy to use solution to get well-formed XML out of random HTML.

Replies are listed 'Best First'.
Re^4: HTML::Parser fun
by FreakyGreenLeaky (Sexton) on Jun 05, 2008 at 15:11 UTC
    Yes, creamygoodness put me onto HTML::Parser some time ago, and I'm finding it hard to look back.

    I then wonder why Your Mother suggested "There are options to allow more liberal/broken HTML to be parsed (or attempted anyway)."?

    I wonder what options he/she was referring to?

    Any idea?