in reply to HTTP::Lite GET - too many requests?

Contact the site owner to get a more direct way to get at the data. That would probably save both of you some bandwidth, and it would allow you to avoid most of the parsing.

If the site owner doesn't want to give you access to the data, there's probably a good chance that you shouldn't be scraping the site at such a big scale in the first place.

  • Comment on Re: HTTP::Lite GET - too many requests?

Replies are listed 'Best First'.
Re^2: HTTP::Lite GET - too many requests?
by mhnatiuk (Novice) on Jul 14, 2008 at 01:48 UTC
    OK, i'll have to explain this a little more. Im doing a reaserach project for my MA in sociology. I contacted site owner almost half year ago, he promised to give me access to their database. That would make my life a lot easier. The problem is that the site owner doesn't know really much about programming (he's a journalist) so that site is managed by some outsourcing guys, which *magically* haven't had time for past 6 months to do this. So, having a permission to access their data i decided to write a crawler do get this data. Do you know if it's possible to use proxy or SOCKS to get around limit of connection per IP which is most probably set on the server?

      Simply don't hammer the site. Make your requests slower, by sleeping between requests. You should sleep at least as long as it took for the last request to get processed. All other "circumvention ideas" will only lead to an arms race between you and the hosting people.

      Note that the hosting people have no interest in your task. They likely only care about keeping the website up and bots from crawling the website.

      Test your crawler on a local copy of some pages.

      Its possible, but its unethical, and it could also include jail time.