in reply to Re^18: Async DNS with LWP
in thread Async DNS with LWP

Any thoughts?

Yes. You are fixated on asynchronous DNS.

If you sent out concurrent DNS requests for 90 million urls, your DNS server/provider would almost certainly blacklist you instantly.

Synchronous DNS is never a factor in throughput after the first second of runtime. Because any time one thread spends waiting for DNS, one or more others will be utilising the processor and bandwidth downloading.

You're approaching the whole problem the wrong way. You're trying to optimise things before you actually have any idea of where the bottlenecks are.

Replies are listed 'Best First'.
Re^20: Async DNS with LWP
by jc (Acolyte) on Oct 09, 2010 at 21:13 UTC
    Fair comment. Let's forget about DNS for now. What do you think about the TCP based bandwidth reducing strategies I proposed?
      What do you think about the TCP based bandwidth reducing strategies I proposed?

      Not much. By the time you sent out 90e6 acks and processed the replies if any, the state the information represents is out-of date. If you averaged 1 per millisecond (and you couldn't), it's going to take 25 hours. By that time, machines that responded have gone down, machines that were down are back up. Besides, aren't most tcp stacks these day programmed to ignore random acks? Utterly pointless.

      The same problem applies to doing connects. Just because a server did/did not responded now, doesn't mean it will/won't next time.

      As for doing a HEADs. you have all the same overheads and latencies in performing a HEAD as you do a GET. The server has to prepare the whole page in order to work out how big it is. Whether you then fetch say 2k or full 100k, is almost insignificant. You're just doubling the work at both your end and theirs.

      Finally, fetching all the smaller pages first is a really bad idea. Smaller pages represent a higher proportion of overhead to actual data retrieved. You might max bandwidth, though you would probably have to increase the number of threads, but less of it is doing actual useful work.

      Your best overall through put will come from processing the files randomly. A random mix of large and small files; fast and slow servers; randomly distributed (topographically and geographically); will give the best chances of there being something available to utilise your bandwidth at any given point in time. As soon as you start to try to 'manage' the output of 90 million unknown servers, any heuristic you might think to apply is far more likely to reduce your throughput than enhance it.

      It's simple statistics. If at any given moment of time, you have 32 or 64 concurrent fetches, the odds of them all being really slow servers, or really large files, or currently down for maintenance; or all fast, small, instantly available etc.; are very small. Assuming a "random" ordering of the picklist. So, on average, you are going to be getting an average throughput. And you choose the number of threads to tune that average close to the saturation level of your pipe.

      But the moment you start to try and order the pick list according to some set of artificial criteria--artificial because you cannot really know the current state of truth about 90 million remote servers--you're just guessing. Even if you could measure the instantaneous affect of a given heuristic, it might show an improvement now, but then be worse for the rest of your 250 hours of work.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.