in reply to Re^2: Use WWW::Mechanize to Download Pictures of Sayuri Anzu
in thread Use WWW::Mechanize to Download Pictures of Sayuri Anzu

There are a few issues involved here. The first, of course, is determining the Terms of Service or "Fair Use" of the site in question. Do they disallow screen scraping? Do they have a robots.txt file that disallows your program accessing the files in question? If so, respecting that is important etiquette. For example, you could check out the robots.txt file in the root directory of the White House Web site.

Assuming there are no ethical objections to writing your program, it might be a good idea to contact the Webmaster of the site you are scraping and ask them what an appropriate delay is. As tilly pointed out, if someone is serving CGIs off an old computer at home, even your two second delay could be problematic.

Cheers,
Ovid

New address of my CGI Course.

  • Comment on Re^3: Ethical issues with screen scraping

Replies are listed 'Best First'.
Re^4: Ethical issues with screen scraping
by TheEnigma (Pilgrim) on Aug 19, 2004 at 18:39 UTC
    Thanks for your input. I've heard of robots.txt files, but I've never dealt with them yet.

    Unlike that monster on the White House web site, this one is rather small:

    User-agent: * Disallow: /cgi-bin/ Disallow: /journals/EJDE/Monographs/ Disallow: /journals/EJDE/Volumes/

    From what I read yesterday about robots.txt files, I'm OK, since I'm scraping the results of a search page that resides in a different directory.

    But your advice about asking the webmaster about an appropriate delay is well taken, I'll see if I can contact him. I'm sure this is a quite capable server, since it's a service of the European Mathematical Society. Plus there are several mirrors.

    But in general though, are you saying that even if I'm accessing high bandwidth servers, I should be using at least a two second delay?

    TheEnigma

      Actually, I just tossed "2 seconds" out there. I've never given much thought to how long such a delay should really be and I don't know that there is really an optimum number. High bandwidth servers should easily handle a delay less than two seconds. If there is any standard on this, I'd certainly like to know.

      Cheers,
      Ovid

      New address of my CGI Course.