comment on

Update: Thanks to helpful nudges by toj and others, I have completed a script that checks the latest reviews for my school, and sends notifications by email. See Check popular review sites for new reviews.

I'm using HTML::TreeBuilder to parse some Yelp pages of my favorite Mexican restaurant(s). I want to create a list of review dates and star ratings for specific business.

The problem is that I want information that is contained in span and meta tags, that don't seem to be a part of the element tree.

Here is the relevant section of HTML:

<div class="review-content">
  <div class="biz-rating biz-rating-very-large clearfix">
    <div itemtype="http://schema.org/Rating" itemscope="" itemprop="re
+viewRating">
      <div class="rating-very-large">
        <i title="4.0 star rating" class="star-img stars_4">
            <img width="84" height="303" src="http://blah/v2/stars_map
+.png" class="offscreen" alt="4.0 star rating">
        </i>
        <meta content="4.0" itemprop="ratingValue">
      </div>
    </div>
    <span class="rating-qualifier">
      <meta content="2011-01-13" itemprop="datePublished">
        1/13/2011
    </span>
  </div>
  <p lang="en" itemprop="description" class="review_comment ieSucks">
    blah!!
  </p> 
</div>
[download]

And here is the element tree:

$tree->look_down( 'class', 'review-content' )

<div class="review-content">
  <div class="biz-rating biz-rating-very-large clearfix">
    <div itemprop="reviewRating" itemscope="itemscope" itemtype="http:
+//schema.org/Rating">
      <div class="rating-very-large">
        <i class="star-img stars_4" title="4.0 star rating">
      <img alt="4.0 star rating" class="offscreen" height="303" src="h
+ttp://blah/v2/stars_map.png" width="84" />
        </i>
      </div>
    </div>
  </div>
</div>
[download]

So far I have the working program below which prints the rating, but I haven't been able to access the span that contains the date. Thanks for your help!

#!/usr/bin/env perl 

use strict;
use warnings;
use utf8;
use Data::Dumper;

use LWP::Simple qw(get);
use Text::Unidecode qw(unidecode);
use HTML::TreeBuilder 5 -weak;    # Ensure weak references in use

my $review_pages = [
    [
        'Jorges #1',
        'http://www.yelp.com/biz/jorges-mexicatessen-encinitas'
    ],
    [
        'Jorges #2',
        'http://www.yelp.com/biz/jorges-mexicatessen-encinitas-2'
    ]
];

for my $page (@$review_pages) {
    my $html = get $page->[1];
    $html =~ s/([^[:ascii:]]+)/unidecode($1)/ge;
    my $tree = HTML::TreeBuilder->new;    # empty tree
    $tree->parse($html);
    print "Review for $page->[0]\n";
    my @items = $tree->look_down( 'class', 'review-content' )
      or die("no items: $!\n");
    for my $item (@items) {
        my $rating = $item->look_down( '_tag', 'i' )
          or die("no rating$!\n");
        my $rating_title = $rating->attr('title');
        print "$rating_title\n";
    }

}
[download]

In reply to Extracting span and meta content with HTML::TreeBuilder by wrinkles

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.