Re^3: Extracting span and meta content with HTML::TreeBuilder

I guessed that might be the case, how about using Xpath ?

#!perl
use strict;
use HTML::TreeBuilder::XPath; 
my $tree = HTML::TreeBuilder::XPath->new;  
$tree->parse_file(\*DATA);

my @items = $tree->findnodes( '//div[@class="review-content"]' )
   or die("no items: $!\n");
for my $item (@items) {
  for ( $item->findnodes( '//meta') ){
    print $_->attr('itemprop');
    print ' = ';
    print $_->attr('content')."\n";
  }
}
[download]

poj

Comment on Re^3: Extracting span and meta content with HTML::TreeBuilder Download Code

Replies are listed 'Best First'.
Re^4: Extracting span and meta content with HTML::TreeBuilder by wrinkles (Pilgrim) on Jul 16, 2014 at 22:17 UTC
That's getting the meta data, but also way too much of what I don't want.	[reply]
Re^5: Extracting span and meta content with HTML::TreeBuilder by tangent (Parson) on Jul 17, 2014 at 01:54 UTC
poj has shown you how to get the meta properties - to get the date just add a test: `next unless $_->attr('itemprop') eq 'datePublished';` [download]	[reply] [d/l]
Re^5: Extracting span and meta content with HTML::TreeBuilder by poj (Abbot) on Jul 17, 2014 at 12:20 UTC
Ok, try another approach #!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file('test.htm'); # change to ->parse($html) my $xpath = '//meta[@itemprop=~/author\|datePublished\|ratingValue/]'; my @items = $tree->findnodes( $xpath ) or die("no items: $!\n"); my $rec={}; my $count = 0; for my $item (@items) { my $prop = $item->attr('itemprop'); $rec->{$prop} = $item->attr('content'); if ($prop eq 'datePublished'){ print ++$count." "; print $rec->{'author'}." ; "; print $rec->{'ratingValue'}." ; "; print $rec->{'datePublished'}."\n"; }; } [download] poj	[reply] [d/l]
Re^6: Extracting span and meta content with HTML::TreeBuilder by wrinkles (Pilgrim) on Jul 18, 2014 at 01:41 UTC
poj, Yes, that's perfect, thank you! The docs on the HTML::TB::XP module is not sufficient (at least for me) to understand how your code works. Where is the documentation that would help me understand this? Did you go to some key documentation to help you sort this out? What do you recommend for me to understand this?	[reply]
Re^7: Extracting span and meta content with HTML::TreeBuilder by Anonymous Monk on Jul 18, 2014 at 02:44 UTC