in reply to Re^3: Ethical issues with screen scraping
in thread Use WWW::Mechanize to Download Pictures of Sayuri Anzu

Thanks for your input. I've heard of robots.txt files, but I've never dealt with them yet.

Unlike that monster on the White House web site, this one is rather small:

User-agent: * Disallow: /cgi-bin/ Disallow: /journals/EJDE/Monographs/ Disallow: /journals/EJDE/Volumes/

From what I read yesterday about robots.txt files, I'm OK, since I'm scraping the results of a search page that resides in a different directory.

But your advice about asking the webmaster about an appropriate delay is well taken, I'll see if I can contact him. I'm sure this is a quite capable server, since it's a service of the European Mathematical Society. Plus there are several mirrors.

But in general though, are you saying that even if I'm accessing high bandwidth servers, I should be using at least a two second delay?

TheEnigma

Replies are listed 'Best First'.
Re^5: Ethical issues with screen scraping
by Ovid (Cardinal) on Aug 19, 2004 at 20:02 UTC

    Actually, I just tossed "2 seconds" out there. I've never given much thought to how long such a delay should really be and I don't know that there is really an optimum number. High bandwidth servers should easily handle a delay less than two seconds. If there is any standard on this, I'd certainly like to know.

    Cheers,
    Ovid

    New address of my CGI Course.