in reply to Re^2: Extracting span and meta content with HTML::TreeBuilder
in thread Extracting span and meta content with HTML::TreeBuilder

I guessed that might be the case, how about using Xpath ?
#!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file(\*DATA); my @items = $tree->findnodes( '//div[@class="review-content"]' ) or die("no items: $!\n"); for my $item (@items) { for ( $item->findnodes( '//meta') ){ print $_->attr('itemprop'); print ' = '; print $_->attr('content')."\n"; } }
poj

Replies are listed 'Best First'.
Re^4: Extracting span and meta content with HTML::TreeBuilder
by wrinkles (Pilgrim) on Jul 16, 2014 at 22:17 UTC
    That's getting the meta data, but also way too much of what I don't want.
      poj has shown you how to get the meta properties - to get the date just add a test:
      next unless $_->attr('itemprop') eq 'datePublished';
      Ok, try another approach
      #!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file('test.htm'); # change to ->parse($html) my $xpath = '//meta[@itemprop=~/author|datePublished|ratingValue/]'; my @items = $tree->findnodes( $xpath ) or die("no items: $!\n"); my $rec={}; my $count = 0; for my $item (@items) { my $prop = $item->attr('itemprop'); $rec->{$prop} = $item->attr('content'); if ($prop eq 'datePublished'){ print ++$count." "; print $rec->{'author'}." ; "; print $rec->{'ratingValue'}." ; "; print $rec->{'datePublished'}."\n"; }; }
      poj
        poj,

        Yes, that's perfect, thank you!

        The docs on the HTML::TB::XP module is not sufficient (at least for me) to understand how your code works. Where is the documentation that would help me understand this? Did you go to some key documentation to help you sort this out? What do you recommend for me to understand this?