in reply to Re^7: Async DNS with LWP
in thread Async DNS with LWP
Hi,
actually I'd already solved most of these problems and that's why I was hammering on with saturating bandwidth. I have a list of 90,000,000 registered .com domains and my breadth first policy ensures that I'm not hitting the same server repeatedly. In fact, I'm crawling so many domains that I don't think even the most rigorous analysis of server logs would flag that my browsing looks like an intensive crawl.
I use the URI module to resolve all URIs to absolute URIs and so this is not a problem. No duplicate absolute URIs are added to my to do list.
My disk I/O policy is also pretty stable. I cache results in memory in a hash table. When that hash table has grown to its user defined memory limit then the in memory hash table is used to update values in an on disk version of the hash table using MLDBM. The theory behind this strategy is that the I/O bottleneck is reduced by lumping writing into one big set of writes.
As it stands I'm trying to build a list of links from the front pages of all registered .com domains. I don't want to have to wait 3 years to do this (my current estimate based on a serial version based on LWP). If I could reduce this to 3 days or 3 hours I would be a very happy chappy.
Now, I've taken a quick look at your example code but notice that you are not actually doing anything with LWP. You've consumed 1/2 GB with only 100 threads that are not performing any TCP communication. The moment you start doing TCP the TCP/IP stack of whatever OS you are using will start consuming even more memory resources. (Note that the small webbook I am developing on has about 1/2 GB to work with).
Now using a Async::HTTP event driven model I've managed to get an average of about 50 concurrent TCP connections working (on Windows 7, I'm still not sure about the internal limitations of Windows 7 TCP and how I can tweak these, the model has changed from XP and the old registry keys are no longer valid). Note that I am consuming no where near the amount of memory you quoted for your 100 threads.
Now, I'm willing to accept that the bloat in your example is probably from loading fat LWP instances and not from the process like thread model used by Perl. So I'll experiment with Threads and Sockets before giving up on Perl for this job.
In conclusion, I'm wondering at this point how difficult it would be to reimplement WWW::Mechanize as a thread safe wrapper around optimized C code. My current experiments seem to suggest that this is what it would take to put Perl in the running for a simple interface with a high performance back end. As it stands, LWP is fat, and Mechanize is even fatter. The easier the implementation the fatter and slower the web crawling application seems to be the current trend with Perl web crawling modules.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^9: Async DNS with LWP
by Corion (Patriarch) on Oct 07, 2010 at 08:53 UTC | |
|
Re^9: Async DNS with LWP
by BrowserUk (Patriarch) on Oct 07, 2010 at 09:56 UTC | |
by jc (Acolyte) on Oct 07, 2010 at 20:19 UTC | |
by BrowserUk (Patriarch) on Oct 07, 2010 at 23:36 UTC | |
by jc (Acolyte) on Oct 08, 2010 at 08:07 UTC | |
by BrowserUk (Patriarch) on Oct 08, 2010 at 11:10 UTC | |
|