"There still isn't one single package that does XSLT 2.0"
There's XML::Saxon::XSLT2 (again, I'm the developer of it). It's a Perl wrapper around the Java Saxon library, using Inline::Java. It's a bit of a pain to install, and the interface between Java and Perl has a potential to be flaky, but right now it's your only option if you need XSLT 2.0 in Perl.
I'd love to see some competitors to it spring up, I really would. The only reason I wrote it is because there was literally no other choice in Perl for XSLT 2.0; not out of a love for Java programming. ;-)
"I do not want to have a war between the monks, but please enlighten me more on why to use HTML5 instead of TreeBuilder"
Two main reasons:
If you want to use XML::LibXML, which as I say is a very good DOM implementation (with XPath, XML Schema, Relax NG, etc) then HTML::HTML5::Parser integrates with it out of the box.
It follows the parsing algorithm from the W3C HTML5 working drafts, allowing it to deal with tag soup in much the same way as desktop browsers do. (It currently passes the majority of the html5lib test suite. html5lib is an HTML parsing library for Python and Ruby, and is pretty much the de facto reference implementation of the HTML5 parsing algorithm.) If you wish to deal with random content off the Web, that's kinda important, because there are an awful lot more people who test their content in desktop browsers than test it in HTML::TreeBuilder.
A practical example. Check out the following piece of HTML in a desktop web browser. Note that (somewhat counter-intuitively) the paragraph containing the emphasised text is rendered above the "Hello World" greeting.
<table> <tr><td>Hello World</td></tr> <p>This will be rendered <em>before</em> the greeting.</p> </table>
Now run this test script:
use 5.010; use HTML::TreeBuilder; use HTML::HTML5::Parser; my $string = do { local $/ = <DATA> }; # slurp say "HTML::HTML5::Parser..."; say HTML::HTML5::Parser -> load_html(string => $string) -> textContent; say "HTML::TreeBuilder..."; say HTML::TreeBuilder -> new_from_content($string) -> as_text; __DATA__ <table> <tr><td>Hello World</td></tr> <p>This will be rendered <em>before</em> the greeting.</p> </table>
Note that HTML::HTML5::Parser returns the content in the same order as your web browser; HTML::TreeBuilder does not.
That said, there are plenty of good things about HTML::TreeBuilder too; and if neither of the above apply to you, then it's a good option. It's stable, mature and well-understood by many Perl programmers. I don't really have anything bad to say about it.
In reply to Re^3: extracting data from HTML
by tobyink
in thread extracting data from HTML
by Jurassic Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |