in reply to Re^6: Async DNS with LWP
in thread Async DNS with LWP
I either need to use Perl's bloated thread model and directly use Perl's UDP and TCP interface or
If you can afford to pay for sufficient bandwidth to allow for serious web-crawling, then affording a box with sufficient memory to start enough threads to saturate that bandwidth, will be the least of your concerns.
I have 4GB of ram and I can run hundreds, even thousands of threads without getting anywhere near to running out of memory. So the "bloat" of the ithreads model is neither here nor there.
Personally, I'd forget about asynchronous DNS. I'd stick an LWP::Parallel::UserAgent instance in one thread per core, and and watch them totally saturate my bandwidth. No Matter how fat a pipe I can afford.
This trivial demo is running 100 threads, each with a parallel user agent, on this box as I type, in just 1/2 GB of memory:
#! perl -slw use strict; use threads ( stack_size => 4096 ); use Thread::Queue; use LWP::Parallel; sub worker { my $tid = threads->tid; my( $Qin, $Qout ) = @_; my $ua = LWP::Parallel::UserAgent->new; print "Thread: $tid ready to go"; while( defined( my $url = $Qin->dequeue ) ) { print $url; } } our $T //= 4; my( $Qin, $Qout ) = map Thread::Queue->new(), 1 ..2; my @workers = map async( \&worker, $Qin, $Qout ), 1 .. $T; sleep 100; ## Read your urls and feed the Q here.
Not that running that many threads on my 4 cores, would be an effective strategy, but even if you're running on one of IBMs $250,000, 256-core, 1024 thread monsters, affording 5GB of memory so that you can run one parallel useragent on each core, is the least of your worries, but the 25 lines of code above will scale to it. AS-IS.
And that's what you get with threads. Simplicity and scalability.
But that bit is easy.
The complicated part of a high throughput webcrawler is not saturating the bandwidth. The complicated parts are:
And that means indexing (digesting) the content, not just the urls, because the same content can hide behind many different urls.
Yes. Saturating your bandwidth is trivial, it is the rest that is hard. Worrying about asynchronous DNS at this point is premature and pointless.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^8: Async DNS with LWP
by jc (Acolyte) on Oct 07, 2010 at 08:49 UTC | |
by Corion (Patriarch) on Oct 07, 2010 at 08:53 UTC | |
by BrowserUk (Patriarch) on Oct 07, 2010 at 09:56 UTC | |
by jc (Acolyte) on Oct 07, 2010 at 20:19 UTC | |
by BrowserUk (Patriarch) on Oct 07, 2010 at 23:36 UTC | |
by jc (Acolyte) on Oct 08, 2010 at 08:07 UTC | |
| |
|
Re^8: Async DNS with LWP
by ikegami (Patriarch) on Oct 07, 2010 at 20:32 UTC | |
by BrowserUk (Patriarch) on Oct 08, 2010 at 12:18 UTC | |
by BrowserUk (Patriarch) on Oct 07, 2010 at 23:33 UTC | |
by ikegami (Patriarch) on Oct 08, 2010 at 06:03 UTC |