Threaded programming is not easy, and while parallel LWP may be what you want, if you would like to learn threading in general this is a good example for introducing the concept.

If you look at the processes your program is going through they list like this:


The bolded items can be parallelized but aren't inherently parallel, and each step can be in a separate thread, combining the Pipeline and Work crew models of threading.

Pipeline? Workcrew? What are those? you ask. The work crew model of threading creates multiple threads that do the same thing on different bits of data, allowing you to leverage multiprocessing on your system to do things faster, however be warned: the overhead of creating a ton of threads will outweigh this benefit, the most work crew threads per job you probably want is the number of cores you have plus one.

The pipeline threading model creates separate threads for separate tasks that are typically run sequentially, but need to be run over large amounts of data, to where each task can feed the next. This again can leverage the multiple cores on a system, if 4 threads are running for a 4 part task you are essentially running 4 of the tasks in parallel, but if one part can run faster (say the ingest part) it doesn't have to wait and can complete, then freeing the system to do other things while the data waits enqueue.

Inter-thread communication

These processing models sound great, but how does data move through the pipeline? There are many hard complex wizardly answers to this question, but perl makes things easy and provides Thread::Queue for dealing with this

Thread::Queue provides a thread-safe construct for passing a Queue between threads. Its two main methods are enqueue and dequeue, identical to its non-thread friendly construct from any basic CS class.

To extensively thread the code you provided, three Thread::Queue objects are required (#XXX: Has anyone ever thought of a Thread::Stack object...?), one for sending URLs from the file to the downloader, one for the downloader to send is content to the parser, and one for the parser to send its parsed data to the file writer.

So I've got my fancy data structures, how do I create threads??

The threads module facilitates the creation and management of threads. Creating threads is very easy to do with perl, simply pass the threads create method a sub ref and some arguments of the subroutine and it will be up and running in its own thread.

Putting it all together

So, you have threads, you have data structures, you have a model. What to do what to do what to do? Well stitch it all together!

  1. You need to refactor your code so that specific tasks are in their own subroutines and set them up to take a Queue or two as arguments and put any useful values into it. Your return value is now your exit code.
  2. sub ReadURLS { my ($queue,$filename) = @_; open my $urlFile,'<',$filename or die "ReadURLS: bad file!: $!" #Use + three arg open for security reasons, die on errors so we don't spew +nonsense or crash worse later. $queue->enqueue(<$urlFile>,undef); #Place each line into the queue, +followed by undef to signal the end of data. return 1; # Success! Return true. Or, if you're a unixy person, retu +rn 0 or maybe even 0 but true. } # Inqueue should be the queue object passed to ReadURLS # Outqueue should be the queue object passed to ParseContent sub DownloadContent { my ($outqueue,$inqueue) = @_; # Each thread needs their own UA my $ua = LWP::UserAgent->new; $ua->agent("Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)") +; $ua->timeout(15); while (my $url = $inqueue->dequeue){ #wait for data and abort when u +ndef comes down the pipe (that means theres no more) #this part should look familiar print "Downloading: $url\n"; my $req = HTTP::Request->new(GET => $url); my $response = $ua->request($req); $outqueue->enqueue($response->content()); #this changes, send the +output to the next task handler. } $outqueue->enqueue(undef); return 1; # See above return } # inqueue should be the outqueue from the downloader sub # outqueue should be passed to the output sub. # regex is of course, your regex. This allows for re-use of the code. +You could also consider taking some parsing rules and using and HTML +parser of some type... sub ParseContent { my ($outqueue,$inqueue,$regex) = @_; while (my $content = $inqueue->dequeue) { $outqueue->enqueue(join '',$content =~ m/$regex/m,"\n"); } $outqueue->enqueue(undef); return 1; } # queue should be the outqueue passed to ParseContent sub WriteOut { my ($queue,$filename) = @_; open my $outFH,'>>',$filename or die "WriteOut: open failed: $!"; while (my $data = $queue->dequeue) { print $outFH $data; } close $outFH; return 1; }
  3. If that seemed confusing, just wait. You'll understand when the code ties it together. You just start all of your various threads with their queues and watch the magic happen.
  4. my $nr_workers = 5; #set this value for the number of side by side dow +nloaders and parsers. Better yet, take it as an argument my $urlfile = "url_planets.txt"; # see comment about arguments my $outfile = "planet_names.txt"; # arguments are nice here too, but n +ot the current point my ($URLQueue,$ContentQueue,$ParsedQueue); $URLQueue = new Thread::Queue; $ContentQueue = new Thread::Queue; $ParsedQueue = new Thread::Queue; my @threadObjs; push @threadObjs,threads->create(&ReadURLS,$URLQueue,$urlfile); #creat +e the reading thread, and store areference to it in the threadObjs ar +ray, this will be important later # Set up the workers, any number of them can manipulate the queues. for (1..$nr_workers) { push @threadObjs,threads->create(&DownloadContent,$ContentQueue,$URL +Queue); push @threadObjs,threads->create(&ParseContent,$ParsedQueue,$Content +Queue,qr!Rotations<i>(.*)</i>!); } push @threadObjs,threads->create(&WriteOut,$ParsedQueue,$outfile); # Now that all the threads are created, the main thread should call jo +in on all of its child thread objects to ask perl to clean up after t +hem, and so it doesn't exit before they're done causing an abrupt ter +mination. foreach my $thr (@threadObjs) { $thr->join(); # Join can have a return value, but checking it adds o +verhead, only if you really need to } # At this point, barring some horrible catastrophe, the specified $out +file should have the desired output.

It should be noted that this is much more than is needed for just a speed boost, and this post is inteded to provide some example based direction to learning threaded programming. If you made it to the end I suggest you go read perlthrtut and explore the references and See Alsos it mentions.

If you're looking for a simpler answer BrowserUK's response will do just fine


In reply to Re: How to download html with threads? by Trizor
in thread How to download html with threads? by Zeokat

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.