Re: extracting data from HTML

Replies are listed 'Best First'.
Re^2: extracting data from HTML by Jurassic Monk (Acolyte) on Jun 03, 2012 at 14:01 UTC
Okay, that seems to work `HTML::TreeBuilder` seems be more forgiving however `$tree->dump` gives a lot of information, luckely `_as_XML_intended` looks more readable again Now the next part... extracting the right pieces of information with XPath some pieces will be quite easy, for example the title. Others will be from traversing a <TABLE>: in the left colum there is a data description, like 'Author', in the right column the name, like 'Wall, L.' (sometimes inside the `<a HREF=...>Author Name</a>` which makes it a bit more complicated, for I only want the text) my guess is to look for a text element in a <td> tag etc, that equals "Author" and then do something with the next sibling?	[reply] [d/l] [select]
Re^3: extracting data from HTML by Corion (Patriarch) on Jun 03, 2012 at 16:41 UTC
All your scraping will always be specific to the page(s) you're scraping. Personally, I like to use CSS selectors, as they give results more quickly than fighting with XPath. Whenever CSS is not enough, I fall back to looking at the XPath expressions Firebug suggests me for the elements, and work from these.	[reply]
Re^2: extracting data from HTML by Jurassic Monk (Acolyte) on Jun 06, 2012 at 18:55 UTC
reading up on HTML::Selector::Xpath, I try to understand from it that it's sole purpose is to translate from CSS to XPath expression. Correct me if I'm wrong. However, it doesn't seem to be capable to do what is needed to solve the problem mentioned in Re^5: extracting data from HTML where the parser seemed to have provided each and everynode with a default namespace Wouldn't it be great to have HTML::Selector::Xpath have the possibillity to have each and every element to include a user definable 'default' namespace prefix? - but only those elements that do not have a namespace by themselves ofcourse If you aske me, it can't be too difficult to implement that, is it?	[reply]
Re^3: extracting data from HTML by Corion (Patriarch) on Jun 06, 2012 at 21:04 UTC
If I understand you right, the (undocumented) "prefix" option already does that.	[reply]
Re^4: extracting data from HTML by tobyink (Canon) on Jun 06, 2012 at 22:17 UTC
Indeed - it was me who requested this feature, and supplied the patch. Mostly in order to support XML::LibXML::QuerySelector, which extends XML::LibXML to support CSS selectors... `my @important_paragraphs = $xmlnode->querySelectorAll('body p.important');` [download] XML::LibXML::QuerySelector passes the selector on to Corion's module, which returns it an XPath. It then queries XML::LibXML for the XPath, then passes the list of the results through a "descendent of" function to make sure that all returned elements are children of the original `$xmlnode`. TL;DR: XML::LibXML::QuerySelector implements W3C Selectors API Level 1 for XML::LibXML. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l] [select]