in reply to Re: extracting data from HTML
in thread extracting data from HTML

too bad...

I had hoped for a bit more exotic result from that HTML::HTML5::Parser. All I got was:

exctracting data from HTML

using Data::Dumper ( $xml ); not a nice result either:

$VAR1 = bless( do{\(my $o = 21921056)}, 'XML::LibXML::Document' );

time to do some more meditation

Replies are listed 'Best First'.
Re^3: extracting data from HTML
by tobyink (Canon) on Jun 03, 2012 at 20:05 UTC

    Yes, it returns plain text because the textContent method is documented as:

    this function returns the content of all text nodes in the descendants of the given node as specified in DOM. (perldoc XML::LibXML::Node)

    Data::Dumper won't be much use with XML::LibXML. Nodes are all just numeric pointers to structures at the other side of the XS boundary (i.e. C structures). There is XML::LibXML::Debugging which allows, e.g.

    print Dumper( $xml->toDebuggingHash );
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      forgive me my brethern but it looks I did bite off more then I could chew and again I endup with bits I can not put together

      the example of
      my $nodes = $xml->findnodes('//*[local-name()="title"]')
      wasn't to much to understand, altohug quite surprised with the construction of the xpath-expression; I would had expected something more easy like
      my $nodes = $xml->findnodes('//html/head/title]')
      But ofcourse, it wouldn't be me if I would get it wrong again

      With HTML::TreeBuilder::XPath it did work, even things like giving me all the table rows from a specific path and dump as text with

      my @stuff = $tree->findvalues( '//td[@class="BTrow"]/table/tr/td/table/tr'); print Dumper(\@stuff);

      Trying that with HTML::HTML5::Parser only resulted in undefined results

      It looks to me I'm missing some bit

      please toby, and others as well, what am I doing wrong here, it can't be the xpath syntax, is it?

      thank you all for your enlighting words and inspiration

        The problem with this:

        $xml->findnodes('//html/head/title')

        ... is that none of the names in the path include a namespace. (X)HTML elements are always namespaced. Hence my rather awkward...

        $xml->findnodes('//*[local-name()="title"]')

        i.e. "select all elements where the local part of the element's name is 'title'"

        Another solution (arguably a lot more readable) would be to forcibly bind the XHTML namespace to a prefix:

        $xml->documentElement->setNamespace( 'http://www.w3.org/1999/xhtml' => 'xh', );

        And then you can freely use that prefix in XPaths.

        $xml->findnodes('//xh:html/xh:head/xh:title')

        This specific problem is mentioned in the XML::LibXML::Node documentation - look for the "NOTE ON NAMESPACES AND XPATH" in the documentation for the findnodes method.

        perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re^3: extracting data from HTML
by Corion (Patriarch) on Jun 03, 2012 at 19:44 UTC

    When you're dealing with XML::LibXML, you'll need to wade through XML::LibXML::Node, from which most of the other classes inherit. Most of them have a ->toString method if you're interested in their contents.

      I was working with my own test stuff and then that didn't work. So I used Toby his example, which I asumed would work fine, but that did not, unfortunatly

Re^3: extracting data from HTML
by Jurassic Monk (Acolyte) on Jun 03, 2012 at 21:01 UTC
    "exctracting data from HTML"

    Oh how insanely stupid! ARRRGGGHHHH!!!!!#$#@@#$%&^%

    All the time I was thinking it was a 'processing indicator' that something was being extracted by the HTML5 routine. ARRRRGGGHHHH!!!!

    /me wonders... do monks curse

    "exctracting data from HTML" is the title of that web page indeed, just as it was supposed to

    now the next things to work on.... tomorrow