in reply to Re^3: Extracting span and meta content with HTML::TreeBuilder
in thread Extracting span and meta content with HTML::TreeBuilder

That's getting the meta data, but also way too much of what I don't want.
  • Comment on Re^4: Extracting span and meta content with HTML::TreeBuilder

Replies are listed 'Best First'.
Re^5: Extracting span and meta content with HTML::TreeBuilder
by tangent (Parson) on Jul 17, 2014 at 01:54 UTC
    poj has shown you how to get the meta properties - to get the date just add a test:
    next unless $_->attr('itemprop') eq 'datePublished';
Re^5: Extracting span and meta content with HTML::TreeBuilder
by poj (Abbot) on Jul 17, 2014 at 12:20 UTC
    Ok, try another approach
    #!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file('test.htm'); # change to ->parse($html) my $xpath = '//meta[@itemprop=~/author|datePublished|ratingValue/]'; my @items = $tree->findnodes( $xpath ) or die("no items: $!\n"); my $rec={}; my $count = 0; for my $item (@items) { my $prop = $item->attr('itemprop'); $rec->{$prop} = $item->attr('content'); if ($prop eq 'datePublished'){ print ++$count." "; print $rec->{'author'}." ; "; print $rec->{'ratingValue'}." ; "; print $rec->{'datePublished'}."\n"; }; }
    poj
      poj,

      Yes, that's perfect, thank you!

      The docs on the HTML::TB::XP module is not sufficient (at least for me) to understand how your code works. Where is the documentation that would help me understand this? Did you go to some key documentation to help you sort this out? What do you recommend for me to understand this?

        See all the links here Re: Retrieve select information from HTML, they're examples(for tree-xpath and others)/walkthroughs/tutorials ... tools like xpather.pl/htmltreexpather.pl can give you paths to start with

        findnodes gives you nodes ... or in case of treebuilder it gives HTML::Element object you can call methods on ... the other player gives XML::LibXML::Node be they XML::LibXML::Element or something else (libxml follows the DOM closely)

        This tutorial needs javascript http://zvon.org/comp/r/tut-XPath_1.html

        On the file you provided xpather spits out stuff like this

        /html/body/div/div/span # posy /html[1]/body[1]/div[1]/div[1]/span[1] # star /*[ local-name() = "html" and position() = 1 ] /*[ local-name() = "body" and position() = 1 ] /*[ local-name() = "div" and position() = 1 and @class = "review-content" ] /*[ local-name() = "div" and position() = 1 and @class = "biz-rating biz-rating-very-large clearfix" ] /*[ local-name() = "span" and @class = "rating-qualifier" and contains(string(), " 1/13/2011 ") ] # rats /html[1] /body[1] /*[ name() = "div" and position() = 1 and @class = "review-content" ] /*[ name() = "div" and position() = 1 and @class = "biz-rating biz-ra +ting-very-large clearfix" ] /*[ name() = "span" and position() = 1 and @class = "rating-qualifier +" ]

        Its a tree :) so  //meta means find a  <meta> anywhere where as  /foo/meta means find every child meta of root element foo <foo><meta></meta>....</foo>

        The examples/tuts give more better examples and explanations