What do you think about the TCP based bandwidth reducing strategies I proposed?

Not much. By the time you sent out 90e6 acks and processed the replies if any, the state the information represents is out-of date. If you averaged 1 per millisecond (and you couldn't), it's going to take 25 hours. By that time, machines that responded have gone down, machines that were down are back up. Besides, aren't most tcp stacks these day programmed to ignore random acks? Utterly pointless.

The same problem applies to doing connects. Just because a server did/did not responded now, doesn't mean it will/won't next time.

As for doing a HEADs. you have all the same overheads and latencies in performing a HEAD as you do a GET. The server has to prepare the whole page in order to work out how big it is. Whether you then fetch say 2k or full 100k, is almost insignificant. You're just doubling the work at both your end and theirs.

Finally, fetching all the smaller pages first is a really bad idea. Smaller pages represent a higher proportion of overhead to actual data retrieved. You might max bandwidth, though you would probably have to increase the number of threads, but less of it is doing actual useful work.

Your best overall through put will come from processing the files randomly. A random mix of large and small files; fast and slow servers; randomly distributed (topographically and geographically); will give the best chances of there being something available to utilise your bandwidth at any given point in time. As soon as you start to try to 'manage' the output of 90 million unknown servers, any heuristic you might think to apply is far more likely to reduce your throughput than enhance it.

It's simple statistics. If at any given moment of time, you have 32 or 64 concurrent fetches, the odds of them all being really slow servers, or really large files, or currently down for maintenance; or all fast, small, instantly available etc.; are very small. Assuming a "random" ordering of the picklist. So, on average, you are going to be getting an average throughput. And you choose the number of threads to tune that average close to the saturation level of your pipe.

But the moment you start to try and order the pick list according to some set of artificial criteria--artificial because you cannot really know the current state of truth about 90 million remote servers--you're just guessing. Even if you could measure the instantaneous affect of a given heuristic, it might show an improvement now, but then be worse for the rest of your 250 hours of work.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

In reply to Re^21: Async DNS with LWP by BrowserUk
in thread Async DNS with LWP by jc

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.