mhnatiuk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Im trying to write script which will download data from webblog service. There are over 4k blogs there. I already wrote parser. Im using HTTP::Lite to perform GET. However, when i start my script, it fails after over 200 requests. I analyzed it using wireshark. Im experiencing a lot of Duplicate ACK's and retransmission though and than my client send FINACK and stops connection.
update it's important to add that i can't just rerun script - if i do so server doesn't respond at all -> my client sends 6 tcp syn segments and gets no response.
My guess is that server/proxy blocks me due to too many requests. Im performing GET in loop so there are no parallel GETs. I suppose that I need to write some "sleeping" code (that will adapt to the changing network performance and, most importantly, to server's limits) or smth like that but before I do i'd like to ask what do you think about it? Besides there is over 4K blogs, each having around 20-30 posts, so if i'd use sleep(10seconds) the whole site would be downloading for like 300-400 hours and i need results in about 3 days :)
sub get_blog{ #.. #.. #..some code $ua->add_req_header("User-Agent", "User-Agent: Mozilla/5.0 (X11; U; Li +nux i686; pl-PL; rv:1.9) Gecko/2008061015 Firefox/3.0\r") ; $ua->request($_[0]) or die "unable to get ".$_[0]; my $content = $ua->body(); $ua->reset(); # IF this is first page of blog -> parse constant elements # of blog END # parse posts and comments # recurency -> find address & get the next page }
there is no additional error handling concering GET.
Thanks

Replies are listed 'Best First'.
Re: HTTP::Lite GET - too many requests?
by Crackers2 (Parson) on Jul 13, 2008 at 18:25 UTC

    Contact the site owner to get a more direct way to get at the data. That would probably save both of you some bandwidth, and it would allow you to avoid most of the parsing.

    If the site owner doesn't want to give you access to the data, there's probably a good chance that you shouldn't be scraping the site at such a big scale in the first place.

      OK, i'll have to explain this a little more. Im doing a reaserach project for my MA in sociology. I contacted site owner almost half year ago, he promised to give me access to their database. That would make my life a lot easier. The problem is that the site owner doesn't know really much about programming (he's a journalist) so that site is managed by some outsourcing guys, which *magically* haven't had time for past 6 months to do this. So, having a permission to access their data i decided to write a crawler do get this data. Do you know if it's possible to use proxy or SOCKS to get around limit of connection per IP which is most probably set on the server?

        Simply don't hammer the site. Make your requests slower, by sleeping between requests. You should sleep at least as long as it took for the last request to get processed. All other "circumvention ideas" will only lead to an arms race between you and the hosting people.

        Note that the hosting people have no interest in your task. They likely only care about keeping the website up and bots from crawling the website.

        Test your crawler on a local copy of some pages.

        Its possible, but its unethical, and it could also include jail time.
Re: HTTP::Lite GET - too many requests?
by Tanktalus (Canon) on Jul 14, 2008 at 03:35 UTC

    Do you have your own website perchance? Though one obvious way (at least from that lead-in) is to try to duplicate the format, etc., and scrape against your own test site first, I'm not going to suggest that - my guess is that that is far more work than any other solution. Instead, I suggest looking at the logs. Check if Google or any other search engine has crawled your site. I bet you'll see delays between requests, which may point to a reasonable amount to sleep. I'm guessing it's about 1-5 seconds of sleep between fetches.

    I'm betting you're simply hitting an automatic web-host DoS counter-attack: self-managed iptables that simply drop incoming packets from apparent DoS attackers. Whoever is hosting the journalist's site is blocking you directly. There are likely a dozen (or a hundred) other sites you're also temporarily blocked from, but you won't notice those ;-)