Re: How to download html with threads?

Threaded programming is not easy, and while parallel LWP may be what you want, if you would like to learn threading in general this is a good example for introducing the concept.

If you look at the processes your program is going through they list like this:

Load List of URLS
Fetch URL
Search through the content
Store the retrieved result in a file

The bolded items can be parallelized but aren't inherently parallel, and each step can be in a separate thread, combining the Pipeline and Work crew models of threading.

Pipeline? Workcrew? What are those? you ask. The work crew model of threading creates multiple threads that do the same thing on different bits of data, allowing you to leverage multiprocessing on your system to do things faster, however be warned: the overhead of creating a ton of threads will outweigh this benefit, the most work crew threads per job you probably want is the number of cores you have plus one.

The pipeline threading model creates separate threads for separate tasks that are typically run sequentially, but need to be run over large amounts of data, to where each task can feed the next. This again can leverage the multiple cores on a system, if 4 threads are running for a 4 part task you are essentially running 4 of the tasks in parallel, but if one part can run faster (say the ingest part) it doesn't have to wait and can complete, then freeing the system to do other things while the data waits enqueue.

Inter-thread communication

These processing models sound great, but how does data move through the pipeline? There are many hard complex wizardly answers to this question, but perl makes things easy and provides Thread::Queue for dealing with this

Thread::Queue provides a thread-safe construct for passing a Queue between threads. Its two main methods are enqueue and dequeue, identical to its non-thread friendly construct from any basic CS class.

To extensively thread the code you provided, three Thread::Queue objects are required (#XXX: Has anyone ever thought of a Thread::Stack object...?), one for sending URLs from the file to the downloader, one for the downloader to send is content to the parser, and one for the parser to send its parsed data to the file writer.

So I've got my fancy data structures, how do I create threads??

The threads module facilitates the creation and management of threads. Creating threads is very easy to do with perl, simply pass the threads create method a sub ref and some arguments of the subroutine and it will be up and running in its own thread.

Putting it all together

So, you have threads, you have data structures, you have a model. What to do what to do what to do? Well stitch it all together!

You need to refactor your code so that specific tasks are in their own subroutines and set them up to take a Queue or two as arguments and put any useful values into it. Your return value is now your exit code.

sub ReadURLS {
  my ($queue,$filename) = @_;

  open my $urlFile,'<',$filename or die "ReadURLS: bad file!: $!" #Use
+ three arg open for security reasons, die on errors so we don't spew 
+nonsense or crash worse later.
  $queue->enqueue(<$urlFile>,undef); #Place each line into the queue, 
+followed by undef to signal the end of data.
  return 1; # Success! Return true. Or, if you're a unixy person, retu
+rn 0 or maybe even 0 but true.
}

# Inqueue should be the queue object passed to ReadURLS
# Outqueue should be the queue object passed to ParseContent
sub DownloadContent {
  my ($outqueue,$inqueue) = @_;

  # Each thread needs their own UA
  my $ua = LWP::UserAgent->new;
  $ua->agent("Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)")
+;
  $ua->timeout(15);

  while (my $url = $inqueue->dequeue){ #wait for data and abort when u
+ndef comes down the pipe (that means theres no more)
    #this part should look familiar
    print "Downloading: $url\n";
    my $req = HTTP::Request->new(GET => $url);
    my $response = $ua->request($req);
    $outqueue->enqueue($response->content()); #this changes, send the 
+output to the next task handler.
  }
  $outqueue->enqueue(undef);
  return 1; # See above return
}

# inqueue should be the outqueue from the downloader sub
# outqueue should be passed to the output sub.
# regex is of course, your regex. This allows for re-use of the code. 
+You could also consider taking some parsing rules and using and HTML 
+parser of some type...
sub ParseContent {
  my ($outqueue,$inqueue,$regex) = @_;
  
  while (my $content = $inqueue->dequeue) {
    $outqueue->enqueue(join '',$content =~ m/$regex/m,"\n");
  }
  $outqueue->enqueue(undef);
  return 1;
}

# queue should be the outqueue passed to ParseContent
sub WriteOut {
  my ($queue,$filename) = @_;

  open my $outFH,'>>',$filename or die "WriteOut: open failed: $!";
  while (my $data = $queue->dequeue) {
    print $outFH $data;
  }
  close $outFH;
  return 1;
}
[download]

If that seemed confusing, just wait. You'll understand when the code ties it together. You just start all of your various threads with their queues and watch the magic happen.

my $nr_workers = 5; #set this value for the number of side by side dow
+nloaders and parsers. Better yet, take it as an argument
my $urlfile = "url_planets.txt"; # see comment about arguments
my $outfile = "planet_names.txt"; # arguments are nice here too, but n
+ot the current point

my ($URLQueue,$ContentQueue,$ParsedQueue);
$URLQueue = new Thread::Queue;
$ContentQueue = new Thread::Queue;
$ParsedQueue = new Thread::Queue;

my @threadObjs;

push @threadObjs,threads->create(&ReadURLS,$URLQueue,$urlfile); #creat
+e the reading thread, and store areference to it in the threadObjs ar
+ray, this will be important later

# Set up the workers, any number of them can manipulate the queues.
for (1..$nr_workers) {
  push @threadObjs,threads->create(&DownloadContent,$ContentQueue,$URL
+Queue);
  push @threadObjs,threads->create(&ParseContent,$ParsedQueue,$Content
+Queue,qr!Rotations<i>(.*)</i>!);
}

push @threadObjs,threads->create(&WriteOut,$ParsedQueue,$outfile);

# Now that all the threads are created, the main thread should call jo
+in on all of its child thread objects to ask perl to clean up after t
+hem, and so it doesn't exit before they're done causing an abrupt ter
+mination.

foreach my $thr (@threadObjs) {
  $thr->join(); # Join can have a return value, but checking it adds o
+verhead, only if you really need to
}

# At this point, barring some horrible catastrophe, the specified $out
+file should have the desired output.
[download]

It should be noted that this is much more than is needed for just a speed boost, and this post is inteded to provide some example based direction to learning threaded programming. If you made it to the end I suggest you go read perlthrtut and explore the references and See Alsos it mentions.

If you're looking for a simpler answer BrowserUK's response will do just fine

Comment on Re: How to download html with threads? Select or Download Code

Replies are listed 'Best First'.
Re^2: How to download html with threads? by BrowserUk (Patriarch) on Jul 31, 2007 at 01:02 UTC
I applaud (and upvoted) your post, but would just point out one thing. Since you are retrieving the entire contents of the urls as a single string, and then processing that string using a single regex, the cost of pushing the data to a shared queue, reading it back to process it and then passing the concatenate results to another thread via another queue is going to cost far more than it will ever save. You are also starting multiple threads all appending to a single file, but you are not mutexing the writes. In the olden days, it was generally considered safe to write append mode to files from multiple processes because CRTs guarenteed 'atomic' writes in append mode. It's not at all clear if any or all builds Perl uses the underlying CRT for this. Nor is it clear whether any or all CRTs make the same guarentees when called from multipe threads. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: How to download html with threads? by Trizor (Pilgrim) on Jul 31, 2007 at 06:26 UTC
There aren't multiple threads on a single file in my example code,only the capability becuase WriteOut was wrapped in a sub to be made a thread. Only one writer thread is created, to atomically dequeue processed data and write it out. As for the overhead issue, while in its current state the overhead doesn't merit separate threads, if this grows and starts using some form of HTML Parser in the parse stage then the split begins to make more sense as HTML parsers can be slower than downloading the document to feed them, separating the processes allows the download to finish faster and make room for the parsing.	[reply]