comment on

nrbrtkls,
I am pretty sure that using IMDB::Film is a violation of IMDB's terms of service:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

Additionally, if you view their http://www.imdb.com/robots.txt file, just about everything has been disallowed. Now I want to give Michael Stepanov the benefit of the doubt and assume that he got permission but then I question why he used LWP::Simple instead of WWW::Mechanzie (the former doesn't respect robots.txt while the latter does).

It also seems pretty obvious to me that IMDB does not want people scraping their recommendations (potentially to reverse engineer the algorithm they developed). Read below for why I came to this conclusion which I admit is a pure guess.

Assuming I am wrong about the TOS, I recommend you open a bug report. I checked the RT queue but did not see this particular one. Since it seemed like an interesting challenge, I decided to set out solving the problem by using the "view source" feature of Firefox and save a local copy of a handful of pages. The first thing I noticed is that the recommendations seen on the page are not in the source. Well, of course they are but not in the straight forward way you think. The second thing I noticed is that if you click on the "See more Recommendations", the original ones are not also listed.

Please do not run the following code in violation of the TOS. As I said above, I developed it using a handful of pages downloaded from Firefox's "view source" to local files. This is also terribly ugly and prone to much breakage - I just wanted to see how to do it. I have emailed the author a pointer to this thread.

#!/usr/bin/perl
use strict;
use warnings;
use IMDB::Film;
use LWP::Simple 'get';
       
my $imdb = new IMDB::Film(crit => '0442933');
die "Something went wrong: " . $imdb->error . "\n" if ! $imdb->status;

for my $info (qw/title year plot rating/) {
    print ucfirst($info), ": ", scalar $imdb->$info, "\n";
}
print "Recommendations:\n";
my $recs = fetch_recommendations($imdb);
while (my ($id, $title) = each %$recs) {
    print "$id: $title\n";
}

sub fetch_recommendations {
    my ($imdb) = @_;
    my $url = 'http://www.imdb.com/title/tt' . $imdb->id . '/recommend
+ations';
    my $content = get($url) || '';
    my ($extract) = $content =~ /by the database(.*?)if you want to se
+e if a movie /s;
    $extract = '' if ! defined $extract;
    my %rec;
    while ($extract =~ m|href="/title/tt(\d+)/">([^<]+)|g) {
        my ($id, $title) = ($1, $2);
        $rec{$id} = $title;
    }
    return \%rec;
}
[download]

Cheers - L~R

In reply to Re: help a Dutchman with hash by Limbic~Region
in thread help a Dutchman with hash by nrbrtkls

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.