Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.
Additionally, if you view their http://www.imdb.com/robots.txt file, just about everything has been disallowed. Now I want to give Michael Stepanov the benefit of the doubt and assume that he got permission but then I question why he used LWP::Simple instead of WWW::Mechanzie (the former doesn't respect robots.txt while the latter does).
It also seems pretty obvious to me that IMDB does not want people scraping their recommendations (potentially to reverse engineer the algorithm they developed). Read below for why I came to this conclusion which I admit is a pure guess.
Assuming I am wrong about the TOS, I recommend you open a bug report. I checked the RT queue but did not see this particular one. Since it seemed like an interesting challenge, I decided to set out solving the problem by using the "view source" feature of Firefox and save a local copy of a handful of pages. The first thing I noticed is that the recommendations seen on the page are not in the source. Well, of course they are but not in the straight forward way you think. The second thing I noticed is that if you click on the "See more Recommendations", the original ones are not also listed.
Please do not run the following code in violation of the TOS. As I said above, I developed it using a handful of pages downloaded from Firefox's "view source" to local files. This is also terribly ugly and prone to much breakage - I just wanted to see how to do it. I have emailed the author a pointer to this thread.
#!/usr/bin/perl use strict; use warnings; use IMDB::Film; use LWP::Simple 'get'; my $imdb = new IMDB::Film(crit => '0442933'); die "Something went wrong: " . $imdb->error . "\n" if ! $imdb->status; for my $info (qw/title year plot rating/) { print ucfirst($info), ": ", scalar $imdb->$info, "\n"; } print "Recommendations:\n"; my $recs = fetch_recommendations($imdb); while (my ($id, $title) = each %$recs) { print "$id: $title\n"; } sub fetch_recommendations { my ($imdb) = @_; my $url = 'http://www.imdb.com/title/tt' . $imdb->id . '/recommend +ations'; my $content = get($url) || ''; my ($extract) = $content =~ /by the database(.*?)if you want to se +e if a movie /s; $extract = '' if ! defined $extract; my %rec; while ($extract =~ m|href="/title/tt(\d+)/">([^<]+)|g) { my ($id, $title) = ($1, $2); $rec{$id} = $title; } return \%rec; }
Cheers - L~R
In reply to Re: help a Dutchman with hash
by Limbic~Region
in thread help a Dutchman with hash
by nrbrtkls
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |