Re^2: extracting data from HTML

Replies are listed 'Best First'.
Re^3: extracting data from HTML by tobyink (Canon) on Jun 03, 2012 at 20:05 UTC
Yes, it returns plain text because the `textContent` method is documented as: `this function returns the content of all text nodes in the descendants of the given node as specified in DOM.` (perldoc XML::LibXML::Node) Data::Dumper won't be much use with XML::LibXML. Nodes are all just numeric pointers to structures at the other side of the XS boundary (i.e. C structures). There is XML::LibXML::Debugging which allows, e.g. `print Dumper( $xml->toDebuggingHash );` [download] `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l] [select]
Re^4: extracting data from HTML by Jurassic Monk (Acolyte) on Jun 05, 2012 at 20:56 UTC
forgive me my brethern but it looks I did bite off more then I could chew and again I endup with bits I can not put together the example of `my $nodes = $xml->findnodes('//*[local-name()="title"]')` wasn't to much to understand, altohug quite surprised with the construction of the xpath-expression; I would had expected something more easy like `my $nodes = $xml->findnodes('//html/head/title]')` But ofcourse, it wouldn't be me if I would get it wrong again With HTML::TreeBuilder::XPath it did work, even things like giving me all the table rows from a specific path and dump as text with `my @stuff = $tree->findvalues( '//td[@class="BTrow"]/table/tr/td/table/tr'); print Dumper(\@stuff);` [download] Trying that with HTML::HTML5::Parser only resulted in undefined results It looks to me I'm missing some bit please toby, and others as well, what am I doing wrong here, it can't be the xpath syntax, is it? thank you all for your enlighting words and inspiration	[reply] [d/l] [select]
Re^5: extracting data from HTML by tobyink (Canon) on Jun 05, 2012 at 21:59 UTC
The problem with this: `$xml->findnodes('//html/head/title')` [download] ... is that none of the names in the path include a namespace. (X)HTML elements are always namespaced. Hence my rather awkward... `$xml->findnodes('//*[local-name()="title"]')` [download] i.e. "select all elements where the local part of the element's name is 'title'" Another solution (arguably a lot more readable) would be to forcibly bind the XHTML namespace to a prefix: `$xml->documentElement->setNamespace( 'http://www.w3.org/1999/xhtml' => 'xh', );` [download] And then you can freely use that prefix in XPaths. `$xml->findnodes('//xh:html/xh:head/xh:title')` [download] This specific problem is mentioned in the XML::LibXML::Node documentation - look for the "NOTE ON NAMESPACES AND XPATH" in the documentation for the `findnodes` method. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l] [select]
Re^6: extracting data from HTML by Jurassic Monk (Acolyte) on Jun 06, 2012 at 15:37 UTC
Re^3: extracting data from HTML by Corion (Patriarch) on Jun 03, 2012 at 19:44 UTC
When you're dealing with XML::LibXML, you'll need to wade through XML::LibXML::Node, from which most of the other classes inherit. Most of them have a `->toString` method if you're interested in their contents.	[reply] [d/l]
Re^4: extracting data from HTML by Jurassic Monk (Acolyte) on Jun 03, 2012 at 20:09 UTC
I was working with my own test stuff and then that didn't work. So I used Toby his example, which I asumed would work fine, but that did not, unfortunatly	[reply]
Re^3: extracting data from HTML by Jurassic Monk (Acolyte) on Jun 03, 2012 at 21:01 UTC
"exctracting data from HTML" Oh how insanely stupid! ARRRGGGHHHH!!!!!#$#@@#$%&^% All the time I was thinking it was a 'processing indicator' that something was being extracted by the HTML5 routine. ARRRRGGGHHHH!!!! /me wonders... do monks curse "exctracting data from HTML" is the title of that web page indeed, just as it was supposed to now the next things to work on.... tomorrow	[reply]