in reply to Re: Extracting span and meta content with HTML::TreeBuilder
in thread Extracting span and meta content with HTML::TreeBuilder

Thanks poj, I first have to extract the "review-content" elements, and pull the span out of those. So I don't have that HTML snippet to work on directly. A nested look_down fails:
for my $page (@$review_pages) { my $html = get $page->[1]; $html =~ s/([^[:ascii:]]+)/unidecode($1)/ge; my $tree = HTML::TreeBuilder->new; # empty tree $tree->parse($html); print "Review for $page->[0]\n"; my @items = $tree->look_down( 'class', 'review-content' ) or die("no items: $!\n"); for my $item (@items) { my @meta = $item->look_down( '_tag', 'meta' ) or die("no meta: $!\n"); # dies here for my $meta_item (@meta) { print $meta_item->attr('itemprop'); print ' = '; print $meta_item->attr('content') . "\n"; } } }

Replies are listed 'Best First'.
Re^3: Extracting span and meta content with HTML::TreeBuilder
by poj (Abbot) on Jul 16, 2014 at 21:35 UTC
    I guessed that might be the case, how about using Xpath ?
    #!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file(\*DATA); my @items = $tree->findnodes( '//div[@class="review-content"]' ) or die("no items: $!\n"); for my $item (@items) { for ( $item->findnodes( '//meta') ){ print $_->attr('itemprop'); print ' = '; print $_->attr('content')."\n"; } }
    poj
      That's getting the meta data, but also way too much of what I don't want.
        poj has shown you how to get the meta properties - to get the date just add a test:
        next unless $_->attr('itemprop') eq 'datePublished';
        Ok, try another approach
        #!perl use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file('test.htm'); # change to ->parse($html) my $xpath = '//meta[@itemprop=~/author|datePublished|ratingValue/]'; my @items = $tree->findnodes( $xpath ) or die("no items: $!\n"); my $rec={}; my $count = 0; for my $item (@items) { my $prop = $item->attr('itemprop'); $rec->{$prop} = $item->attr('content'); if ($prop eq 'datePublished'){ print ++$count." "; print $rec->{'author'}." ; "; print $rec->{'ratingValue'}." ; "; print $rec->{'datePublished'}."\n"; }; }
        poj