Re^18: Async DNS with LWP

Replies are listed 'Best First'.
Re^19: Async DNS with LWP by BrowserUk (Patriarch) on Oct 09, 2010 at 20:06 UTC
Any thoughts? Yes. You are fixated on asynchronous DNS. If you sent out concurrent DNS requests for 90 million urls, your DNS server/provider would almost certainly blacklist you instantly. Synchronous DNS is never a factor in throughput after the first second of runtime. Because any time one thread spends waiting for DNS, one or more others will be utilising the processor and bandwidth downloading. You're approaching the whole problem the wrong way. You're trying to optimise things before you actually have any idea of where the bottlenecks are.	[reply]
Re^20: Async DNS with LWP by jc (Acolyte) on Oct 09, 2010 at 21:13 UTC
Fair comment. Let's forget about DNS for now. What do you think about the TCP based bandwidth reducing strategies I proposed?	[reply]
Re^21: Async DNS with LWP by BrowserUk (Patriarch) on Oct 10, 2010 at 08:50 UTC
What do you think about the TCP based bandwidth reducing strategies I proposed? Not much. By the time you sent out 90e6 acks and processed the replies if any, the state the information represents is out-of date. If you averaged 1 per millisecond (and you couldn't), it's going to take 25 hours. By that time, machines that responded have gone down, machines that were down are back up. Besides, aren't most tcp stacks these day programmed to ignore random acks? Utterly pointless. The same problem applies to doing connects. Just because a server did/did not responded now, doesn't mean it will/won't next time. As for doing a HEADs. you have all the same overheads and latencies in performing a HEAD as you do a GET. The server has to prepare the whole page in order to work out how big it is. Whether you then fetch say 2k or full 100k, is almost insignificant. You're just doubling the work at both your end and theirs. Finally, fetching all the smaller pages first is a really bad idea. Smaller pages represent a higher proportion of overhead to actual data retrieved. You might max bandwidth, though you would probably have to increase the number of threads, but less of it is doing actual useful work. Your best overall through put will come from processing the files randomly. A random mix of large and small files; fast and slow servers; randomly distributed (topographically and geographically); will give the best chances of there being something available to utilise your bandwidth at any given point in time. As soon as you start to try to 'manage' the output of 90 million unknown servers, any heuristic you might think to apply is far more likely to reduce your throughput than enhance it. It's simple statistics. If at any given moment of time, you have 32 or 64 concurrent fetches, the odds of them all being really slow servers, or really large files, or currently down for maintenance; or all fast, small, instantly available etc.; are very small. Assuming a "random" ordering of the picklist. So, on average, you are going to be getting an average throughput. And you choose the number of threads to tune that average close to the saturation level of your pipe. But the moment you start to try and order the pick list according to some set of artificial criteria--artificial because you cannot really know the current state of truth about 90 million remote servers--you're just guessing. Even if you could measure the instantaneous affect of a given heuristic, it might show an improvement now, but then be worse for the rest of your 250 hours of work. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re^19: Async DNS with LWP by roboticus (Chancellor) on Oct 10, 2010 at 17:07 UTC
jc: To amplify BrowserUk's point: Too many people want to create an "optimal" solution right out of the gate. But computer and network behaviour is so complicated that without information, you can't determine what to optimize, nor which behaviour you may need to fix. Remember: First ... make it work! Next ... make it work right! Finally ... make it work right now! So first just try coding the simplest thing you can that works. After you've made it work correctly, is it fast/good enough? If so you're done--and with far less work! Only if it's not fast/good enough do you need to make any improvements. So, to improve it, first figure out what needs improvement: If you just guess, you're likely to be wrong, and you'll waste your time. Look at your measurement results to see where you can get the most improvement, make the improvement, and check whether you're done. If not, take more measurements, choose the next chunk of code, etc. How do you know when you're done? If at all possible, choose a performance goal. Once you meet it, you're done. Sometimes you'll find that you must accept worse performance than you planned (if you can't improve the performance enough), or you'll have to investigate better algorithms, faster hard drives, more memory, etc. ...roboticus	[reply]
Re^20: Async DNS with LWP by jc (Acolyte) on Oct 10, 2010 at 20:14 UTC
Hi. This is in fact what I have done. First I made a simple serial version based on LWP. This averaged a domain per second (2.8 years for 90,000,000 domains). Then I used Parallel::ForkManager to throw more processes at the job hoping to get an improvement. By tweaking various factors like the number of processes and the length of the timeout I was only able to process up to an average of 10 domains per second. Then I tried with AnyEvent::HTTP, I was able to average about 100 per second. I'm trying to improve on this.	[reply]