Hi,
Im trying to write script which will download data from webblog service. There are over 4k blogs there. I already wrote parser. Im using HTTP::Lite to perform GET. However, when i start my script, it fails after over 200 requests. I analyzed it using wireshark. Im experiencing a lot of Duplicate ACK's and retransmission though and than my client send FINACK and stops connection.
update it's important to add that i can't just rerun script - if i do so server doesn't respond at all -> my client sends 6 tcp syn segments and gets no response.
My guess is that server/proxy blocks me due to too many requests. Im performing GET in loop so there are no parallel GETs. I suppose that I need to write some "sleeping" code (that will adapt to the changing network performance and, most importantly, to server's limits) or smth like that but before I do i'd like to ask what do you think about it? Besides there is over 4K blogs, each having around 20-30 posts, so if i'd use sleep(10seconds) the whole site would be downloading for like 300-400 hours and i need results in about 3 days :)
sub get_blog{
#..
#..
#..some code
$ua->add_req_header("User-Agent", "User-Agent: Mozilla/5.0 (X11; U; Li
+nux i686; pl-PL; rv:1.9) Gecko/2008061015 Firefox/3.0\r") ;
$ua->request($_[0]) or die "unable to get ".$_[0];
my $content = $ua->body();
$ua->reset();
# IF this is first page of blog -> parse constant elements
# of blog END
# parse posts and comments
# recurency -> find address & get the next page
}
there is no additional error handling concering GET.
Thanks
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.