in reply to How to download html with threads?
Threaded programming is not easy, and while parallel LWP may be what you want, if you would like to learn threading in general this is a good example for introducing the concept.
If you look at the processes your program is going through they list like this:
Pipeline? Workcrew? What are those? you ask. The work crew model of threading creates multiple threads that do the same thing on different bits of data, allowing you to leverage multiprocessing on your system to do things faster, however be warned: the overhead of creating a ton of threads will outweigh this benefit, the most work crew threads per job you probably want is the number of cores you have plus one.
The pipeline threading model creates separate threads for separate tasks that are typically run sequentially, but need to be run over large amounts of data, to where each task can feed the next. This again can leverage the multiple cores on a system, if 4 threads are running for a 4 part task you are essentially running 4 of the tasks in parallel, but if one part can run faster (say the ingest part) it doesn't have to wait and can complete, then freeing the system to do other things while the data waits enqueue.
These processing models sound great, but how does data move through the pipeline? There are many hard complex wizardly answers to this question, but perl makes things easy and provides Thread::Queue for dealing with this
Thread::Queue provides a thread-safe construct for passing a Queue between threads. Its two main methods are enqueue and dequeue, identical to its non-thread friendly construct from any basic CS class.
To extensively thread the code you provided, three Thread::Queue objects are required (#XXX: Has anyone ever thought of a Thread::Stack object...?), one for sending URLs from the file to the downloader, one for the downloader to send is content to the parser, and one for the parser to send its parsed data to the file writer.
The threads module facilitates the creation and management of threads. Creating threads is very easy to do with perl, simply pass the threads create method a sub ref and some arguments of the subroutine and it will be up and running in its own thread.
So, you have threads, you have data structures, you have a model. What to do what to do what to do? Well stitch it all together!
sub ReadURLS { my ($queue,$filename) = @_; open my $urlFile,'<',$filename or die "ReadURLS: bad file!: $!" #Use + three arg open for security reasons, die on errors so we don't spew +nonsense or crash worse later. $queue->enqueue(<$urlFile>,undef); #Place each line into the queue, +followed by undef to signal the end of data. return 1; # Success! Return true. Or, if you're a unixy person, retu +rn 0 or maybe even 0 but true. } # Inqueue should be the queue object passed to ReadURLS # Outqueue should be the queue object passed to ParseContent sub DownloadContent { my ($outqueue,$inqueue) = @_; # Each thread needs their own UA my $ua = LWP::UserAgent->new; $ua->agent("Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)") +; $ua->timeout(15); while (my $url = $inqueue->dequeue){ #wait for data and abort when u +ndef comes down the pipe (that means theres no more) #this part should look familiar print "Downloading: $url\n"; my $req = HTTP::Request->new(GET => $url); my $response = $ua->request($req); $outqueue->enqueue($response->content()); #this changes, send the +output to the next task handler. } $outqueue->enqueue(undef); return 1; # See above return } # inqueue should be the outqueue from the downloader sub # outqueue should be passed to the output sub. # regex is of course, your regex. This allows for re-use of the code. +You could also consider taking some parsing rules and using and HTML +parser of some type... sub ParseContent { my ($outqueue,$inqueue,$regex) = @_; while (my $content = $inqueue->dequeue) { $outqueue->enqueue(join '',$content =~ m/$regex/m,"\n"); } $outqueue->enqueue(undef); return 1; } # queue should be the outqueue passed to ParseContent sub WriteOut { my ($queue,$filename) = @_; open my $outFH,'>>',$filename or die "WriteOut: open failed: $!"; while (my $data = $queue->dequeue) { print $outFH $data; } close $outFH; return 1; }
my $nr_workers = 5; #set this value for the number of side by side dow +nloaders and parsers. Better yet, take it as an argument my $urlfile = "url_planets.txt"; # see comment about arguments my $outfile = "planet_names.txt"; # arguments are nice here too, but n +ot the current point my ($URLQueue,$ContentQueue,$ParsedQueue); $URLQueue = new Thread::Queue; $ContentQueue = new Thread::Queue; $ParsedQueue = new Thread::Queue; my @threadObjs; push @threadObjs,threads->create(&ReadURLS,$URLQueue,$urlfile); #creat +e the reading thread, and store areference to it in the threadObjs ar +ray, this will be important later # Set up the workers, any number of them can manipulate the queues. for (1..$nr_workers) { push @threadObjs,threads->create(&DownloadContent,$ContentQueue,$URL +Queue); push @threadObjs,threads->create(&ParseContent,$ParsedQueue,$Content +Queue,qr!Rotations<i>(.*)</i>!); } push @threadObjs,threads->create(&WriteOut,$ParsedQueue,$outfile); # Now that all the threads are created, the main thread should call jo +in on all of its child thread objects to ask perl to clean up after t +hem, and so it doesn't exit before they're done causing an abrupt ter +mination. foreach my $thr (@threadObjs) { $thr->join(); # Join can have a return value, but checking it adds o +verhead, only if you really need to } # At this point, barring some horrible catastrophe, the specified $out +file should have the desired output.
If you're looking for a simpler answer BrowserUK's response will do just fine
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How to download html with threads?
by BrowserUk (Patriarch) on Jul 31, 2007 at 01:02 UTC | |
by Trizor (Pilgrim) on Jul 31, 2007 at 06:26 UTC |