Update: Thanks to helpful nudges by toj and others, I have completed a script that checks the latest reviews for my school, and sends notifications by email. See Check popular review sites for new reviews.
I'm using HTML::TreeBuilder to parse some Yelp pages of my favorite Mexican restaurant(s). I want to create a list of review dates and star ratings for specific business.
The problem is that I want information that is contained in span and meta tags, that don't seem to be a part of the element tree.
Here is the relevant section of HTML:
<div class="review-content"> <div class="biz-rating biz-rating-very-large clearfix"> <div itemtype="http://schema.org/Rating" itemscope="" itemprop="re +viewRating"> <div class="rating-very-large"> <i title="4.0 star rating" class="star-img stars_4"> <img width="84" height="303" src="http://blah/v2/stars_map +.png" class="offscreen" alt="4.0 star rating"> </i> <meta content="4.0" itemprop="ratingValue"> </div> </div> <span class="rating-qualifier"> <meta content="2011-01-13" itemprop="datePublished"> 1/13/2011 </span> </div> <p lang="en" itemprop="description" class="review_comment ieSucks"> blah!! </p> </div>
And here is the element tree:
$tree->look_down( 'class', 'review-content' ) <div class="review-content"> <div class="biz-rating biz-rating-very-large clearfix"> <div itemprop="reviewRating" itemscope="itemscope" itemtype="http: +//schema.org/Rating"> <div class="rating-very-large"> <i class="star-img stars_4" title="4.0 star rating"> <img alt="4.0 star rating" class="offscreen" height="303" src="h +ttp://blah/v2/stars_map.png" width="84" /> </i> </div> </div> </div> </div>
So far I have the working program below which prints the rating, but I haven't been able to access the span that contains the date. Thanks for your help!
#!/usr/bin/env perl use strict; use warnings; use utf8; use Data::Dumper; use LWP::Simple qw(get); use Text::Unidecode qw(unidecode); use HTML::TreeBuilder 5 -weak; # Ensure weak references in use my $review_pages = [ [ 'Jorges #1', 'http://www.yelp.com/biz/jorges-mexicatessen-encinitas' ], [ 'Jorges #2', 'http://www.yelp.com/biz/jorges-mexicatessen-encinitas-2' ] ]; for my $page (@$review_pages) { my $html = get $page->[1]; $html =~ s/([^[:ascii:]]+)/unidecode($1)/ge; my $tree = HTML::TreeBuilder->new; # empty tree $tree->parse($html); print "Review for $page->[0]\n"; my @items = $tree->look_down( 'class', 'review-content' ) or die("no items: $!\n"); for my $item (@items) { my $rating = $item->look_down( '_tag', 'i' ) or die("no rating$!\n"); my $rating_title = $rating->attr('title'); print "$rating_title\n"; } }
In reply to Extracting span and meta content with HTML::TreeBuilder by wrinkles
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |