in reply to Re: Use WWW::Mechanize to Download Pictures of Sayuri Anzu
in thread Use WWW::Mechanize to Download Pictures of Sayuri Anzu

I just wrote a script for someone using LWP to do a search of a web site and extract some data. He wants to take an existing file of bibliographic data, and get an additional piece of data on each article from this web site. His example file had only about 100 articles I needed to search for. I don't know how many his real file will have.

This seems very similar to Anonymous Monk's script insofar as it's repeatedly accessing a site. Does etiquette dictate my script sleep also, or are these different animals?

And if I should, isn't 2 seconds a little long? I would think the server could process a lot of requests in that time.

TheEnigma

  • Comment on Re^2: Use WWW::Mechanize to Download Pictures of Sayuri Anzu

Replies are listed 'Best First'.
Re^3: Ethical issues with screen scraping
by Ovid (Cardinal) on Aug 18, 2004 at 19:06 UTC

    There are a few issues involved here. The first, of course, is determining the Terms of Service or "Fair Use" of the site in question. Do they disallow screen scraping? Do they have a robots.txt file that disallows your program accessing the files in question? If so, respecting that is important etiquette. For example, you could check out the robots.txt file in the root directory of the White House Web site.

    Assuming there are no ethical objections to writing your program, it might be a good idea to contact the Webmaster of the site you are scraping and ask them what an appropriate delay is. As tilly pointed out, if someone is serving CGIs off an old computer at home, even your two second delay could be problematic.

    Cheers,
    Ovid

    New address of my CGI Course.

      Thanks for your input. I've heard of robots.txt files, but I've never dealt with them yet.

      Unlike that monster on the White House web site, this one is rather small:

      User-agent: * Disallow: /cgi-bin/ Disallow: /journals/EJDE/Monographs/ Disallow: /journals/EJDE/Volumes/

      From what I read yesterday about robots.txt files, I'm OK, since I'm scraping the results of a search page that resides in a different directory.

      But your advice about asking the webmaster about an appropriate delay is well taken, I'll see if I can contact him. I'm sure this is a quite capable server, since it's a service of the European Mathematical Society. Plus there are several mirrors.

      But in general though, are you saying that even if I'm accessing high bandwidth servers, I should be using at least a two second delay?

      TheEnigma

        Actually, I just tossed "2 seconds" out there. I've never given much thought to how long such a delay should really be and I don't know that there is really an optimum number. High bandwidth servers should easily handle a delay less than two seconds. If there is any standard on this, I'd certainly like to know.

        Cheers,
        Ovid

        New address of my CGI Course.