Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that parses www.imagefap.com. Well, ALMOST parses, anyway! It has a page that pops up asking my bot to say I'm 18 or over to view some of the pictures (all it asks is to follow a link that looks like
<a href="image.php?id=1117159255&check=6f4e9753bb326bc0a11f76f188160af +3">I am 18 years of age or older</a>
Calling it directly with LWP::UserAgent doesn't appear to be working, perhaps it's not storing it in my cookie jar this way.

My question is, because I am using proxies to load the pages I'm using LWP::UserAgent. However, to follow the links to say my bot is of age, I need to use WWW::Mechanize.

I'm not sure how to get the two to work together. Can anyone see, from my code below, how I can manage to get it to follow the link with WWW::Mechanize while keeping the cookies and all that I need inside UserAgent? Or can this ALL be done using WWW::Mechanize without directly calling UserAgent?

my $ua = LWP::UserAgent->new; $ua->proxy('http', "http://$ad:$pt"); # assume this is a working + IP with a port number $ua->timeout(5); my ($one, $two) = split(/::/, $shuffled[$urlcnt]); #one is the U +RL, two is a custom referer url $ua->default_header('Referer' => "$two"); # added my own referer my $response = $ua->get($one); # my attempt at scraping the page and manually GETting the ver +ification link $response->content =~ m/image\.php\?id=(\d+)\&check=([a-zA-Z0- +9]+)/; print "\t\t- http://imagefap.com/image.php?id=$1&check=$2 link +\n"; $ua->get("http://imagefap.com/image.php?id=$1&check=$2"); $ua->mirror("http://imagefap.com/image.php?id=$1&check=$2", "f +ap.html") or die "Error: $!";
I have been trying to get this ImageFap bot to work for quite some time now and it's fun trying new tricks to get it to work.. but for now, I am stumped.

Replies are listed 'Best First'.
Re: mixing www::mech and lwp::UA
by davido (Cardinal) on Nov 27, 2006 at 04:53 UTC

    In the documentation for WWW::Mechanize you'll read, "WWW::Mechanize is a proper subclass of LWP::UserAgent and you can also use any of LWP::UserAgent's methods."

    That means you don't need to explicitly 'use' and create an object instance of both. WWW::Mechanize @ISA LWP::UserAgent.

    I wonder if the sites you're trying to scrape are intentionally rejecting your queries based on your user agent name? That's easy to test, by using $mech->agent_alias() to set your user agent name to something like Internet Explorer. Otherwise, the webserver sees you as something like LWP User Agent (if I recall). You could also check with the website's operator to see if they've got anything unusual going on there that might break your cookies. ...you are scraping within the TOS of the site, right? ;)


    Dave

      Hi.

      I'm not exactly sure what that means, that I don't need to explicility use and create and object instance of both.

      From that, I imagined you meant that I could just call LWP::UserAgent and perform WWW::Mech functions without calling WWW::Mech. So from that, I tried

      $ua->agent_alias("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) +");
      And it errors out saying there's no object method agent_alias.

      Do I have it backwards? If I call WWW::Mech, can I then use any LWP::UA methods I want?

Re: mixing www::mech and lwp::UA
by jdporter (Paladin) on Nov 27, 2006 at 03:51 UTC

    Probably you should be using WWW::Mechanize for the entire job. An object of that class is also a LWP::UserAgent, so if you ever feel you need a $ua for something, you should be able to use the $wwwmech for it.

    Other than that, I don't know what to tell you. I'd probably be investigating why your cookies don't seem to get stored. It's supposed to Just Work, so....

    We're building the house of the future together.
Re: mixing www::mech and lwp::UA
by clscott (Friar) on Nov 27, 2006 at 19:47 UTC
    Mechanize is a sub-class of LWP, so set up a cookie jar:
    my $mech = WWW::Mechanize->new(); $mech->proxy(...) $mech->cookie_jar({}); $mech->get( ... );
    --
    Clayton