Using Mechanize to get website content

techtoys has asked for the wisdom of the Perl Monks concerning the following question:

Forgive me for my intrusion on the wise PerlMonks, but am in need of guidance for a newbie.

What I'm attempting to do is login to amazon.com with my account and then download a copy of their products page. I need to be logged in in order to see all of the prices listed.

This originally worked last week, but after making no changes to it, it stopped today.

I was not able to login directly to amazon.com as their login page was not being submitted no matter what I tried. So I used their affiliate login page and was able to successfully login and can confirm this by the page returned. But I no longer stay logged in when I try to download the different products page.

Thanks so much in advance for any help that can be provided.

#!/usr/bin/perl

use WWW::Mechanize;
use HTTP::Cookies;

my $url = "https://affiliate-program.amazon.com/";
my $appurl = "https://www.amazon.com/gp/bestsellers/videogames";

my $username = 'xxxxx';
my $password = 'xxxxx';

my $mech = WWW::Mechanize->new(autocheck => 1);
$mech->cookie_jar(HTTP::Cookies->new());
$mech->get($url);
$mech->form_name('sign_in');
$mech->field(email => $username);
$mech->field(password => $password);
$mech->submit();

my $login_content = $mech->content();

# go to an amazon url
$mech->get($appurl);
my $app_content = $mech->content();

open(IFP,"> amazontest.html ");
     print IFP $app_content;
     close IFP;
[download]

Comment on Using Mechanize to get website content Download Code

Replies are listed 'Best First'.
Re: Using Mechanize to get website content by Your Mother (Archbishop) on Jun 16, 2009 at 05:29 UTC
I don't have help for you with your question directly. I will say it's against Amazon's terms of service to do what you're doing and there's no reason to bother; it's difficult and dicey and they often swap page layouts live as parts of A/B tests or phasing in changes across many servers. They have a free API which gets the information robustly; and, though it won't feel like it at first, quite directly and simply. I don't think there are any good Perl tutorials on it though which is sad. URI::Amazon::APA is an excellent, minimal interface to the REST version of the APA (Amazon Product Advertising) portion of the AWS (Amazon Web Services). You need a dev account and to do some document diving but it has big advantages over what you're trying to do: 1) it's not breaking your user agreement, 2) it's robust and won't fail/change (well, not more than once every five years or so anyway). I don't know if there are modules with more features to do this on the CPAN right now. I do know that the legacy Amazon interface to the ECS (E-Commerce Service) which is what most of the older modules are (and, again, sadly, the one I wrote for myself and a client and will have to replace) going to stop working on August 15th. Amazon gave a loooooong lead time on this; over a year if memory serves. And the original disconnect date is well past due, again, IIRC. So this deadline seems unlikely to be pushed back.	[reply]
Re: Using Mechanize to get website content by Marshall (Canon) on Jun 16, 2009 at 05:49 UTC
I have no idea of what Amazon's policy is regarding this. I suspect that it depends upon what you are going to do with the info! I have quite a number of gizmo's that log on and "scrape info" from sites, but I don't use or distribute this info in anyway that violates user agreements. Basically if I just automate something that I could do myself manually and use the info like I would if I did it manually by cut-n-paste (subject to user account agreement), normally this is ok. Update: legal things aside, I have "throttles" on my web-automation stuff. I don't want to be impolite by generating too many requests/time. My programs can be patient. How much load you are generating on the host side should be a consideration for your web-automation stuff, or at least it is for mine.	[reply]