in reply to Re^17: Async DNS with LWP
in thread Async DNS with LWP

I see what you mean. In fact, I was toying with the idea of the following architecture to maximise throughput:

* Setup asycnronous DNS to quickly resolve all of the 90,000,000 domains to find out which domains reside at the same IP (shared hosting domains).

* Send out a TCP ack to each web server (asynchronously) to get a shortlist of which domain names actually have a web server which responds (should send a RST in response to unexpected ACK).

* Then send out TCP connects with a short timeout to short list servers which respond faster.

* To those that respond fast enough send out HEAD requests to obtain document sizes.

* Asynchronously GET the smallest documents first such that database of links experiences the fastest possible growth. Any thoughts?

Replies are listed 'Best First'.
Re^19: Async DNS with LWP
by BrowserUk (Patriarch) on Oct 09, 2010 at 20:06 UTC
    Any thoughts?

    Yes. You are fixated on asynchronous DNS.

    If you sent out concurrent DNS requests for 90 million urls, your DNS server/provider would almost certainly blacklist you instantly.

    Synchronous DNS is never a factor in throughput after the first second of runtime. Because any time one thread spends waiting for DNS, one or more others will be utilising the processor and bandwidth downloading.

    You're approaching the whole problem the wrong way. You're trying to optimise things before you actually have any idea of where the bottlenecks are.

      Fair comment. Let's forget about DNS for now. What do you think about the TCP based bandwidth reducing strategies I proposed?
        What do you think about the TCP based bandwidth reducing strategies I proposed?

        Not much. By the time you sent out 90e6 acks and processed the replies if any, the state the information represents is out-of date. If you averaged 1 per millisecond (and you couldn't), it's going to take 25 hours. By that time, machines that responded have gone down, machines that were down are back up. Besides, aren't most tcp stacks these day programmed to ignore random acks? Utterly pointless.

        The same problem applies to doing connects. Just because a server did/did not responded now, doesn't mean it will/won't next time.

        As for doing a HEADs. you have all the same overheads and latencies in performing a HEAD as you do a GET. The server has to prepare the whole page in order to work out how big it is. Whether you then fetch say 2k or full 100k, is almost insignificant. You're just doubling the work at both your end and theirs.

        Finally, fetching all the smaller pages first is a really bad idea. Smaller pages represent a higher proportion of overhead to actual data retrieved. You might max bandwidth, though you would probably have to increase the number of threads, but less of it is doing actual useful work.

        Your best overall through put will come from processing the files randomly. A random mix of large and small files; fast and slow servers; randomly distributed (topographically and geographically); will give the best chances of there being something available to utilise your bandwidth at any given point in time. As soon as you start to try to 'manage' the output of 90 million unknown servers, any heuristic you might think to apply is far more likely to reduce your throughput than enhance it.

        It's simple statistics. If at any given moment of time, you have 32 or 64 concurrent fetches, the odds of them all being really slow servers, or really large files, or currently down for maintenance; or all fast, small, instantly available etc.; are very small. Assuming a "random" ordering of the picklist. So, on average, you are going to be getting an average throughput. And you choose the number of threads to tune that average close to the saturation level of your pipe.

        But the moment you start to try and order the pick list according to some set of artificial criteria--artificial because you cannot really know the current state of truth about 90 million remote servers--you're just guessing. Even if you could measure the instantaneous affect of a given heuristic, it might show an improvement now, but then be worse for the rest of your 250 hours of work.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^19: Async DNS with LWP
by roboticus (Chancellor) on Oct 10, 2010 at 17:07 UTC

    jc:

    To amplify BrowserUk's point: Too many people want to create an "optimal" solution right out of the gate. But computer and network behaviour is so complicated that without information, you can't determine *what* to optimize, nor which behaviour you may need to fix.

    Remember:

    • First ... make it work!
    • Next ... make it work right!
    • Finally ... make it work right now!

    So first just try coding the simplest thing you can that works. After you've made it work correctly, is it fast/good enough? If so you're done--and with far less work!

    Only if it's not fast/good enough do you need to make any improvements. So, to improve it, first figure out what needs improvement: If you just guess, you're likely to be wrong, and you'll waste your time. Look at your measurement results to see where you can get the most improvement, make the improvement, and check whether you're done. If not, take more measurements, choose the next chunk of code, etc.

    How do you know when you're done? If at all possible, choose a performance goal. Once you meet it, you're done. Sometimes you'll find that you must accept worse performance than you planned (if you can't improve the performance enough), or you'll have to investigate better algorithms, faster hard drives, more memory, etc.

    ...roboticus

      Hi. This is in fact what I have done. First I made a simple serial version based on LWP. This averaged a domain per second (2.8 years for 90,000,000 domains). Then I used Parallel::ForkManager to throw more processes at the job hoping to get an improvement. By tweaking various factors like the number of processes and the length of the timeout I was only able to process up to an average of 10 domains per second. Then I tried with AnyEvent::HTTP, I was able to average about 100 per second. I'm trying to improve on this.