in reply to Re^11: Async DNS with LWP
in thread Async DNS with LWP

Yes! Definitely! Ideally I would want as fat a bandwidth as possible. Ideally with lots and lots of memory. I'm negotiating with my University but that is unlikely to get anywhere fast. It looks like I'm going to have to fork out for a dedicated server myself. Seeing that I am concentrating on .com domains at the moment geographical location and connection factors of the box could be important. I was thinking something like http://www.m5hosting.com/ValueNet.php These guys also seem to be the only ones I can find with decent OpenBSD dedicated servers. With 1,500 GB/month transfer and a maximum burstable speed of 100Mbps with no additional charge it looks like as good a deal as I'm likely to find. According to http://www.m5hosting.com/network.php they seem to be pretty well connected as well. Especially for US based traffic. I've also used these guys before and their service was pretty good. They fix problems fast and don't charge for the service. However, as it stands I'm connected to the internet through a USB modem via a mobile phones operator that offers variable bandwidths (depending on where you are) and is limited to only 100 hours per month. In any case, the amount of time I crawl the net in is always going to ultimately limited by bandwidth and how best I use it. Asynchronous DNS and HTTP seem to be fundamental issues. I'm almost at the point where I'm thinking. Why didn't I just code this in C from the very beginning? I once made (years ago and I no longer have the code) an asynchronous DNS resolver in C and, yes, it did take much longer to implement but as far as I can remember it was as fast as the 100Mbps burstable speed could take. In fact, it ran so well that we had to redirect it to a better DNS server (a cluster of load balanced DNS servers) so that they could keep up with the requests. I've even thought that maybe it is a good idea to write my own recursive resolver to see if there are ways this process can be optimised. Maybe this is overkill but I want this to work in hours (maybe in days) but certainly not in years.

Replies are listed 'Best First'.
Re^13: Async DNS with LWP
by BrowserUk (Patriarch) on Oct 08, 2010 at 11:10 UTC

    I'm sorry, but worrying about async DNS at this point is ... well, pointless. Let's do some math.

    With your current setup, 90e6 sites with say an average of 100k per home page(*). To download that lot in your 100 hour allocation, you'd need to be fetching constantly at a rate of 25Mbytes/s. That would would (conservatively) require a 250Mbps connection. To do it in your target 3 hours you'd need an 8 Gbps connection.

    Now, I'm not sure what data rates you can achieve with GSM (GPRS,EGPRS) in the US, but I'm pretty sure they'll be measured in 10s of Kbps. Not Mbps much less Gbps.

    Even once you moved to your hoster, if you could sustain their 100Mbps burst rate indefinitely, 90e6 * 100k would take ~250 hours to download. And they'd cut you off long before that.

    Worrying about shaving a few milliseconds here and there using asynchronous DNS is just a drop in the ocean.

    (*)They seem to range from the minimalist google at 8k, up to the commercial bloat of sky.com at 250k; but 100k is a good average of the few I looked at.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      You raise some good points. What kind of throughput have you managed to implement with your solution? What bandwidth was available? I'm guessing you never saturated it, right?

        We had an early 4-cpu (real cpus not cores) SMP box with a (shared) 1Gbps link direct to the (a) backbone. We easily saturated that with 32 threads running bog standard LWP & Digest::MD5--provided we didn't store the data to disk. Even with raided (5 I think) disks, the bottleneck was storing what we could read. That was circa 7 years ago.

        To do the job properly at the scale you are talking about, you'd need to run a distributed crawler, each node with dedicated, high-speed, raided local drives--or hugely expensive SSD arrays, and a distributed queueing mechanism.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.