in reply to extracting data from HTML

Another approach for extracting data by using XPath expressions is to use HTML::TreeBuilder::XPath. For example, App::scrape uses HTML::Selector::XPath to convert CSS selectors to XPath and applies these to the tree returned from HTML::TreeBuilder::XPath to extract the payload.

Replies are listed 'Best First'.
Re^2: extracting data from HTML
by Jurassic Monk (Acolyte) on Jun 03, 2012 at 14:01 UTC

    Okay, that seems to work HTML::TreeBuilder seems be more forgiving

    however $tree->dump gives a lot of information, luckely _as_XML_intended looks more readable again

    Now the next part... extracting the right pieces of information with XPath

    some pieces will be quite easy, for example the title. Others will be from traversing a <TABLE>:
    in the left colum there is a data description, like 'Author', in the right column the name, like 'Wall, L.' (sometimes inside the <a HREF=...>Author Name</a> which makes it a bit more complicated, for I only want the text)

    my guess is to look for a text element in a <td> tag etc, that equals "Author" and then do something with the next sibling?

      All your scraping will always be specific to the page(s) you're scraping. Personally, I like to use CSS selectors, as they give results more quickly than fighting with XPath. Whenever CSS is not enough, I fall back to looking at the XPath expressions Firebug suggests me for the elements, and work from these.
Re^2: extracting data from HTML
by Jurassic Monk (Acolyte) on Jun 06, 2012 at 18:55 UTC

    reading up on HTML::Selector::Xpath, I try to understand from it that it's sole purpose is to translate from CSS to XPath expression. Correct me if I'm wrong.

    However, it doesn't seem to be capable to do what is needed to solve the problem mentioned in Re^5: extracting data from HTML where the parser seemed to have provided each and everynode with a default namespace

    Wouldn't it be great to have HTML::Selector::Xpath have the possibillity to have each and every element to include a user definable 'default' namespace prefix? - but only those elements that do not have a namespace by themselves ofcourse

    If you aske me, it can't be too difficult to implement that, is it?

      If I understand you right, the (undocumented) "prefix" option already does that.

        Indeed - it was me who requested this feature, and supplied the patch. Mostly in order to support XML::LibXML::QuerySelector, which extends XML::LibXML to support CSS selectors...

        my @important_paragraphs = $xmlnode->querySelectorAll('body p.important');

        XML::LibXML::QuerySelector passes the selector on to Corion's module, which returns it an XPath. It then queries XML::LibXML for the XPath, then passes the list of the results through a "descendent of" function to make sure that all returned elements are children of the original $xmlnode.

        TL;DR: XML::LibXML::QuerySelector implements W3C Selectors API Level 1 for XML::LibXML.

        perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'