Re^20: Async DNS with LWP

Replies are listed 'Best First'.
Re^21: Async DNS with LWP by BrowserUk (Patriarch) on Oct 10, 2010 at 08:50 UTC
What do you think about the TCP based bandwidth reducing strategies I proposed? Not much. By the time you sent out 90e6 acks and processed the replies if any, the state the information represents is out-of date. If you averaged 1 per millisecond (and you couldn't), it's going to take 25 hours. By that time, machines that responded have gone down, machines that were down are back up. Besides, aren't most tcp stacks these day programmed to ignore random acks? Utterly pointless. The same problem applies to doing connects. Just because a server did/did not responded now, doesn't mean it will/won't next time. As for doing a HEADs. you have all the same overheads and latencies in performing a HEAD as you do a GET. The server has to prepare the whole page in order to work out how big it is. Whether you then fetch say 2k or full 100k, is almost insignificant. You're just doubling the work at both your end and theirs. Finally, fetching all the smaller pages first is a really bad idea. Smaller pages represent a higher proportion of overhead to actual data retrieved. You might max bandwidth, though you would probably have to increase the number of threads, but less of it is doing actual useful work. Your best overall through put will come from processing the files randomly. A random mix of large and small files; fast and slow servers; randomly distributed (topographically and geographically); will give the best chances of there being something available to utilise your bandwidth at any given point in time. As soon as you start to try to 'manage' the output of 90 million unknown servers, any heuristic you might think to apply is far more likely to reduce your throughput than enhance it. It's simple statistics. If at any given moment of time, you have 32 or 64 concurrent fetches, the odds of them all being really slow servers, or really large files, or currently down for maintenance; or all fast, small, instantly available etc.; are very small. Assuming a "random" ordering of the picklist. So, on average, you are going to be getting an average throughput. And you choose the number of threads to tune that average close to the saturation level of your pipe. But the moment you start to try and order the pick list according to some set of artificial criteria--artificial because you cannot really know the current state of truth about 90 million remote servers--you're just guessing. Even if you could measure the instantaneous affect of a given heuristic, it might show an improvement now, but then be worse for the rest of your 250 hours of work. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]

Replies are listed 'Best First'.

Re^21: Async DNS with LWP
by BrowserUk (Patriarch) on Oct 10, 2010 at 08:50 UTC

What do you think about the TCP based bandwidth reducing strategies I proposed?

Not much. By the time you sent out 90e6 acks and processed the replies if any, the state the information represents is out-of date. If you averaged 1 per millisecond (and you couldn't), it's going to take 25 hours. By that time, machines that responded have gone down, machines that were down are back up. Besides, aren't most tcp stacks these day programmed to ignore random acks? Utterly pointless.

The same problem applies to doing connects. Just because a server did/did not responded now, doesn't mean it will/won't next time.

As for doing a HEADs. you have all the same overheads and latencies in performing a HEAD as you do a GET. The server has to prepare the whole page in order to work out how big it is. Whether you then fetch say 2k or full 100k, is almost insignificant. You're just doubling the work at both your end and theirs.

Finally, fetching all the smaller pages first is a really bad idea. Smaller pages represent a higher proportion of overhead to actual data retrieved. You might max bandwidth, though you would probably have to increase the number of threads, but less of it is doing actual useful work.

Your best overall through put will come from processing the files randomly. A random mix of large and small files; fast and slow servers; randomly distributed (topographically and geographically); will give the best chances of there being something available to utilise your bandwidth at any given point in time. As soon as you start to try to 'manage' the output of 90 million unknown servers, any heuristic you might think to apply is far more likely to reduce your throughput than enhance it.

It's simple statistics. If at any given moment of time, you have 32 or 64 concurrent fetches, the odds of them all being really slow servers, or really large files, or currently down for maintenance; or all fast, small, instantly available etc.; are very small. Assuming a "random" ordering of the picklist. So, on average, you are going to be getting an average throughput. And you choose the number of threads to tune that average close to the saturation level of your pipe.

But the moment you start to try and order the pick list according to some set of artificial criteria--artificial because you cannot really know the current state of truth about 90 million remote servers--you're just guessing. Even if you could measure the instantaneous affect of a given heuristic, it might show an improvement now, but then be worse for the rest of your 250 hours of work.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

[reply]