in reply to Re: extracting data from HTML
in thread extracting data from HTML

it's alright to be biased

I do like the idea off being up to date as much as possible, I sometimes have the suspicious feeling that the PERL community can't get up pace with all the changes anyways. There still isn't one single package that does XSLT 2.0 and XPath 2.0 and so on. Partly we rely on libxml2, which is not goin to get an update to the next level.

I managed to get HTML::TreeBuilder::XPath working and playing around with it at the moment. Getting the right text from the HTML source with XPath is quite a struggle anyways, resulting frequnetly in errors... but... I get the grips and it feels more confident then running regex's on the source, especially since some parts consists of more then one <p>-elements. ->findvalues()does do a nice trick. Only need to get rid off the nasty cp1252 codes that slipped into a iso-8859-1 encoded html, the € symbol isn't part of it

I do not want to have a war between the monks, but please enlighten me more on why to use HTML5 instead of TreeBuilder

Replies are listed 'Best First'.
Re^3: extracting data from HTML
by tobyink (Canon) on Jun 03, 2012 at 19:49 UTC

    "There still isn't one single package that does XSLT 2.0"

    There's XML::Saxon::XSLT2 (again, I'm the developer of it). It's a Perl wrapper around the Java Saxon library, using Inline::Java. It's a bit of a pain to install, and the interface between Java and Perl has a potential to be flaky, but right now it's your only option if you need XSLT 2.0 in Perl.

    I'd love to see some competitors to it spring up, I really would. The only reason I wrote it is because there was literally no other choice in Perl for XSLT 2.0; not out of a love for Java programming. ;-)

    "I do not want to have a war between the monks, but please enlighten me more on why to use HTML5 instead of TreeBuilder"

    Two main reasons:

    • If you want to use XML::LibXML, which as I say is a very good DOM implementation (with XPath, XML Schema, Relax NG, etc) then HTML::HTML5::Parser integrates with it out of the box.

    • It follows the parsing algorithm from the W3C HTML5 working drafts, allowing it to deal with tag soup in much the same way as desktop browsers do. (It currently passes the majority of the html5lib test suite. html5lib is an HTML parsing library for Python and Ruby, and is pretty much the de facto reference implementation of the HTML5 parsing algorithm.) If you wish to deal with random content off the Web, that's kinda important, because there are an awful lot more people who test their content in desktop browsers than test it in HTML::TreeBuilder.

      A practical example. Check out the following piece of HTML in a desktop web browser. Note that (somewhat counter-intuitively) the paragraph containing the emphasised text is rendered above the "Hello World" greeting.

      <table> <tr><td>Hello World</td></tr> <p>This will be rendered <em>before</em> the greeting.</p> </table>

      Now run this test script:

      use 5.010; use HTML::TreeBuilder; use HTML::HTML5::Parser; my $string = do { local $/ = <DATA> }; # slurp say "HTML::HTML5::Parser..."; say HTML::HTML5::Parser -> load_html(string => $string) -> textContent; say "HTML::TreeBuilder..."; say HTML::TreeBuilder -> new_from_content($string) -> as_text; __DATA__ <table> <tr><td>Hello World</td></tr> <p>This will be rendered <em>before</em> the greeting.</p> </table>

      Note that HTML::HTML5::Parser returns the content in the same order as your web browser; HTML::TreeBuilder does not.

    That said, there are plenty of good things about HTML::TreeBuilder too; and if neither of the above apply to you, then it's a good option. It's stable, mature and well-understood by many Perl programmers. I don't really have anything bad to say about it.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'