in reply to Re^9: Async DNS with LWP
in thread Async DNS with LWP

As it stands, I'm developing this on a single core. The computer crashing hasn't been an issue for me but in order to minimise on repetition of work the state of the crawl is saved to disk each time the memory is filled and so, proportionally, not that much work would actually be repeated should my little box ever decide to crash. I do take your point, though, and will experiment further with writing to disk on the fly. I'm not sure what sort of optimisations you would propose to make writing quicker. As far as I know, in general, the only way of making writing to disk quicker is to attempt to write as much data in one go and to make those writes to consecutive space (not really possible for a hash table). Anyway, I'm not interested in duplicate content because I don't even process the content. The goal is to create a map of links on the internet. Whether there are a number of different roads that lead to the same location at this point does not concern me. What concerns me is to exhaustively map those roads. So, that brings us back to what my real present problem is. Making the best use of bandwidth available.

Replies are listed 'Best First'.
Re^11: Async DNS with LWP
by BrowserUk (Patriarch) on Oct 07, 2010 at 23:36 UTC

    Are you intending to move this to a beefier box at some point in the future? If so, what spec of box and what bandwidth will it have?

    If not, what bandwidth do you have on the current box?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Yes! Definitely! Ideally I would want as fat a bandwidth as possible. Ideally with lots and lots of memory. I'm negotiating with my University but that is unlikely to get anywhere fast. It looks like I'm going to have to fork out for a dedicated server myself. Seeing that I am concentrating on .com domains at the moment geographical location and connection factors of the box could be important. I was thinking something like http://www.m5hosting.com/ValueNet.php These guys also seem to be the only ones I can find with decent OpenBSD dedicated servers. With 1,500 GB/month transfer and a maximum burstable speed of 100Mbps with no additional charge it looks like as good a deal as I'm likely to find. According to http://www.m5hosting.com/network.php they seem to be pretty well connected as well. Especially for US based traffic. I've also used these guys before and their service was pretty good. They fix problems fast and don't charge for the service. However, as it stands I'm connected to the internet through a USB modem via a mobile phones operator that offers variable bandwidths (depending on where you are) and is limited to only 100 hours per month. In any case, the amount of time I crawl the net in is always going to ultimately limited by bandwidth and how best I use it. Asynchronous DNS and HTTP seem to be fundamental issues. I'm almost at the point where I'm thinking. Why didn't I just code this in C from the very beginning? I once made (years ago and I no longer have the code) an asynchronous DNS resolver in C and, yes, it did take much longer to implement but as far as I can remember it was as fast as the 100Mbps burstable speed could take. In fact, it ran so well that we had to redirect it to a better DNS server (a cluster of load balanced DNS servers) so that they could keep up with the requests. I've even thought that maybe it is a good idea to write my own recursive resolver to see if there are ways this process can be optimised. Maybe this is overkill but I want this to work in hours (maybe in days) but certainly not in years.

        I'm sorry, but worrying about async DNS at this point is ... well, pointless. Let's do some math.

        With your current setup, 90e6 sites with say an average of 100k per home page(*). To download that lot in your 100 hour allocation, you'd need to be fetching constantly at a rate of 25Mbytes/s. That would would (conservatively) require a 250Mbps connection. To do it in your target 3 hours you'd need an 8 Gbps connection.

        Now, I'm not sure what data rates you can achieve with GSM (GPRS,EGPRS) in the US, but I'm pretty sure they'll be measured in 10s of Kbps. Not Mbps much less Gbps.

        Even once you moved to your hoster, if you could sustain their 100Mbps burst rate indefinitely, 90e6 * 100k would take ~250 hours to download. And they'd cut you off long before that.

        Worrying about shaving a few milliseconds here and there using asynchronous DNS is just a drop in the ocean.

        (*)They seem to range from the minimalist google at 8k, up to the commercial bloat of sky.com at 250k; but 100k is a good average of the few I looked at.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.