In Japan, people like to set up image boards which are like a simple forum where pretty much anyone can upload a picture. People can also add comments to these pictures. It's a simple and fun system.

People use these image boards to post pictures of all kinds of things, but mostly, they're used to post hotties. (No surprise there, I'm sure.) In this Cool Use For Perl(TM), WWW::Mechanize is used to crawl an image board that specializes in pictures of Sayuri Anzu who is a popular model.

This script will find download all the images posted up on this board on the first run. This script can also be run again at a later date to get new pictures that have been uploaded since the last time the script was run. It's a lot easier than doing right-click, save about 200 times, that's for sure.

#!/usr/bin/perl -w use strict; use WWW::Mechanize; use File::Basename; # load the image board my $mech = WWW::Mechanize->new(); $mech->get("http://uomimi.s3.x-beat.com/imgboard/imgboard.cgi"); # the first page has fewer forms than the rest of the pages. my $which_form = 4; # let's see how deep this goes. do { # get all the image links my @anzu = $mech->find_all_links(url_regex => qr/img-box.*\.jpg$/); foreach (@anzu) { my $filename = basename($_->url); unless (-e $filename) { # download (if we don't have it already) print "$filename\n"; $mech->get($_->url_abs, ':content_file' => $filename); $mech->back(); } else { # quit (if we've already got this) exit 0; } } # go to the next page if ($mech->form_number($which_form)) { $mech->submit; $which_form = 6; } else { $which_form = 0; } # repeat until we can't go any further } until ($which_form == 0); # vim:sts=2 sw=2 expandtab

Replies are listed 'Best First'.
Re: Use WWW::Mechanize to Download Pictures of Sayuri Anzu
by Ovid (Cardinal) on Aug 04, 2004 at 21:05 UTC

    I'll not comment on this other than to say it's considered polite to put a delay (such as a sleep 2 or something) between downloads so as not to have an accidental DOS attack on their server.

    Cheers,
    Ovid

    New address of my CGI Course.

      I just wrote a script for someone using LWP to do a search of a web site and extract some data. He wants to take an existing file of bibliographic data, and get an additional piece of data on each article from this web site. His example file had only about 100 articles I needed to search for. I don't know how many his real file will have.

      This seems very similar to Anonymous Monk's script insofar as it's repeatedly accessing a site. Does etiquette dictate my script sleep also, or are these different animals?

      And if I should, isn't 2 seconds a little long? I would think the server could process a lot of requests in that time.

      TheEnigma

        There are a few issues involved here. The first, of course, is determining the Terms of Service or "Fair Use" of the site in question. Do they disallow screen scraping? Do they have a robots.txt file that disallows your program accessing the files in question? If so, respecting that is important etiquette. For example, you could check out the robots.txt file in the root directory of the White House Web site.

        Assuming there are no ethical objections to writing your program, it might be a good idea to contact the Webmaster of the site you are scraping and ask them what an appropriate delay is. As tilly pointed out, if someone is serving CGIs off an old computer at home, even your two second delay could be problematic.

        Cheers,
        Ovid

        New address of my CGI Course.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Use WWW::Mechanize to Download Pictures of Sayuri Anzu
by Anonymous Monk on Aug 05, 2004 at 04:54 UTC
    You bring up a good point. I've noticed that their servers seem to be under a high load when it's morning on the West Coast (of the USA). However, during the evenings, downloads seem to go much faster. This knowledge + a well-placed sleep(2) statement should prevent anyone from harming this server.

      I recommend falling on the side of over-politeness b/c it's more likely to keep resources open and less likely to inspire lots of new tools to block this kind of thing. I usually sleep 30 seconds or more per domain (requests to different sites don't really need sleep). The big cats like Google go even longer with their bots.