in reply to Use WWW::Mechanize to Download Pictures of Sayuri Anzu

I'll not comment on this other than to say it's considered polite to put a delay (such as a sleep 2 or something) between downloads so as not to have an accidental DOS attack on their server.

Cheers,
Ovid

New address of my CGI Course.

  • Comment on Re: Use WWW::Mechanize to Download Pictures of Sayuri Anzu

Replies are listed 'Best First'.
Re^2: Use WWW::Mechanize to Download Pictures of Sayuri Anzu
by TheEnigma (Pilgrim) on Aug 18, 2004 at 03:23 UTC
    I just wrote a script for someone using LWP to do a search of a web site and extract some data. He wants to take an existing file of bibliographic data, and get an additional piece of data on each article from this web site. His example file had only about 100 articles I needed to search for. I don't know how many his real file will have.

    This seems very similar to Anonymous Monk's script insofar as it's repeatedly accessing a site. Does etiquette dictate my script sleep also, or are these different animals?

    And if I should, isn't 2 seconds a little long? I would think the server could process a lot of requests in that time.

    TheEnigma

      There are a few issues involved here. The first, of course, is determining the Terms of Service or "Fair Use" of the site in question. Do they disallow screen scraping? Do they have a robots.txt file that disallows your program accessing the files in question? If so, respecting that is important etiquette. For example, you could check out the robots.txt file in the root directory of the White House Web site.

      Assuming there are no ethical objections to writing your program, it might be a good idea to contact the Webmaster of the site you are scraping and ask them what an appropriate delay is. As tilly pointed out, if someone is serving CGIs off an old computer at home, even your two second delay could be problematic.

      Cheers,
      Ovid

      New address of my CGI Course.

        Thanks for your input. I've heard of robots.txt files, but I've never dealt with them yet.

        Unlike that monster on the White House web site, this one is rather small:

        User-agent: * Disallow: /cgi-bin/ Disallow: /journals/EJDE/Monographs/ Disallow: /journals/EJDE/Volumes/

        From what I read yesterday about robots.txt files, I'm OK, since I'm scraping the results of a search page that resides in a different directory.

        But your advice about asking the webmaster about an appropriate delay is well taken, I'll see if I can contact him. I'm sure this is a quite capable server, since it's a service of the European Mathematical Society. Plus there are several mirrors.

        But in general though, are you saying that even if I'm accessing high bandwidth servers, I should be using at least a two second delay?

        TheEnigma

A reply falls below the community's threshold of quality. You may see it by logging in.