Okay. Here are a few of the causes of your slowdown:
Inserting a breif pause (select undef,undef,undef, 0.1;) into your main thread loop drops the overall cpu usage from 99% to 3-4%.
The simple expedient of moving that outside the loop and re-using the same user agent for each request made by the thread speeds the processing and reduces the memory consumption/thrashing enormously.
I think this was the main cause of your slowdown.
If the load imposed by running 20 threads uses all your cpu, adding another 80 into the mix will not help. Your code will just spend more time swapping and less time processing.
The trick is always, start with a few threads, check that you aren't leaking memory or thrashing the cpu to death, and then increase the number of threads until adding more doesn't result in any greater throughput.
With those few changes, I managed to process 2270 head requests in 97 seconds. And that is with my 40kb/s dial-up connection--using just 25 10* threads!
* Update: Once I made the number of threads a command line parameter, I find that I get no discernable increase in throughput once I move above 10 threads. Despite that using 10 threads uses barely 10% of my cpu, the limitations on throughput seems to be soley the limited bandwidth of my connection. If you have a faster connection, you may be able to increase throughput by using more threads, but don't go mad. Start with 10 and increase in small jumps.
[16:57:57.93] P:\test>418095 >nul Queued urls: 2270 Time:97 Done
By my calculations, that means I should allow a throughput of 85,000 urls an hour--which I think well exceeds your requirements. With a little optimisation, this could probably be speeded up considerably.
Note: This is done on Win32--I have no feel for what sort of results you will get under linux.
I would be most grateful to hear what sort of throuput you get with what number of threads on your system please?
Here the version of your code I used to get the above results:
#!/usr/bin/perl use strict; use threads; use threads::shared; use LWP::UserAgent; use HTTP::Request::Common; use Thread::Queue; $| = 1; my $thread_num : shared = 0; my $max_thread = 25; my $exit = 0; my $dump = 0; my $start_time = 0; my %tid : shared = (); my $task_q = Thread::Queue->new(); my $result_q = Thread::Queue->new(); my @urls = <DATA>; chomp @urls; $task_q->enqueue( @urls ); undef @urls; warn "Queued urls: ", $task_q->pending, "\n"; $start_time = time(); threads->new(\&thread_do) for (1..$max_thread); while ( $task_q->pending ) { select undef, undef, undef, 0.1; print $result_q->dequeue(), "\n" while $result_q->pending(); if ($dump++ > 100000) { # print "Dump\n"; dump_tid(\%tid); $dump = 0; } } sleep 3; ## Give the task threads time to finish up warn "\n\nTime:" . (time() - $start_time) . "\n"; warn "Done\n"; sub thread_do { threads->self->detach(); my $tid = threads->self->tid(); my $ua = LWP::UserAgent->new( timeout => 3 ); while ( $task_q->pending ) { my $url = $task_q->dequeue; my $res = $ua->request( HEAD $url ); $result_q->enqueue( "$tid; $url ::= " . $res->code() . ";" ); lock %tid; $tid{ $tid }++; } } sub dump_tid { # my $tid = shift; # open (DUMP, "> dump.txt"); # print DUMP "$_ = $tid->{$_}\n" foreach keys %$tid; # close DUMP; }
In reply to Re: Problem with ithreads
by BrowserUk
in thread Problem with ithreads
by 2NetFly
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |