Re: Re (tilly) 5: Parallel Downloads using Parallel::ForkManager or whatever works!!!

Thanks for the guidance. I understand the ongoing dialogue concept. Cool!

The indentation style got lost in my unfamiliarity with the posting methods. Good advice that I don't always follow, but will endeavor to follow from now on. The strict.pm thing I will study. I'm not too familiar with it.Man, I've got a lot to learn.

Anyway, I have read the documentation for Parallel::ForkManager, its pretty straight forward. I had a similar version to what you suggested working with the LWP::Simple getstore just to see if it worked. It did. However, what I’m trying to do is incorporate the ForkManager into a working metacrawler. The LWP::Simple getstore is O.K., but not preferred.

The metacrawler, I call it “MEGA-Metacrawler” retrieves pages, based on keyword lists, from various search engines, basically stores the web pages into a hash, processes them for secondary patterns, and then spits out the processed results. I am using parts of LWP::Simple, LWP::UserAgent, HTTP::Status, HTTP::Request, and Digest::Perl::MD5. Its effective but, dog sloooooowwwww against the hundreds/thousands of pages that must be reviewed and processed.

I have had our resident Perl Guru look at the code for efficiency, duplication etc … It seems to be O.K. in that department. The problem seems to be two-fold, the speed or lack thereof downloading one web page at a time, and the processing of one downloaded page at a time in the pattern matching routine. Thus, the need for the parallel processes.

BTW, in the code I posted, I initially left out the “);” in the following line:

$res = $ua->request($req, "$name.html"

(Yeah I just realized, I need to use line numbers next time!)

Nevertheless, if I comment out all of the “$pm->” lines of code it will download the pages in the hash. This also works in the MEGA-Metacrawler. But again its fast for 3 or 4 urls and slow for 4000 – 5000. Finally, I believe that I need to use the GET instead of the GETSTORE. So the code I have provided is a nearly exact excerpt from the metacrawler (which is nearly 1000 lines and thus not provided). I intend on plugging the Parallel routine into the MEGA-metacrawler once the final version works properly.

I will spend the next day(s) continuing to study merlyn’s article referenced above and re-reading the ParallelForkManager documentation to gain a better understanding of it.

Thanks again

Your Humble Perl Initiate

Comment on Re: Re (tilly) 5: Parallel Downloads using Parallel::ForkManager or whatever works!!!

Replies are listed 'Best First'.
Re (tilly) 7: Parallel Downloads using Parallel::ForkManager or whatever works!!! by tilly (Archbishop) on Jan 09, 2002 at 14:54 UTC
While I see many style/design issues with the code (shared globals, looping when you are only going to do things once, the indentation issue I pointed out, etc) the only issues that I see which could cause things to fail horribly are the fact that your call to wait on the children only takes place within the child code so it is never being called, and you are running on NT. (I should point out that the second issue is not simple OS bigotry. Windows NT does not support a native fork, and the emulation has issues.) An alternative method for starting parallel processes on NT which has worked for me is IPC::Open3. See Run commands in parallel for a demonstration of how to do that. This is less efficient than forking on Unix, but it is portable. (NT is, by design, much less friendly than Unix to having multiple active processes trying to do work at the same time. NT would prefer one process with multiple threads, which Perl does not support very well.) An incidental conceptual misunderstanding that I see is that you are assuming that DOCUMENT_RETRIEVER will have a useful return in the parent. It won't, but since you don't use that it shouldn't be causing problems that you see (yet). However what this means is that children and parents will need to figure out how to communicate, and the odds are pretty good that it will be through external files. And an incidental note. Most people who like to be called things like "Perl guru" aren't. In general I have found that people who think of themselves as being really good do so because they have never been in the larger pond of good people. But without that experience they have had to invent things themselves, which means that they may be better than their friends, but they are not going to be very good next to a random person who has absorbed "standard good advice". And a final note. Parallel processing like this with many processes works best when you are doing things where the bottleneck is I/O. If you are doing computationally intensive work, then it is preferable to run only as many processes as you have CPUs. Because of this I would suggest that you rethink your design. It is probably going to make sense to have one loop where you download your files in parallel, and then have another loop where you do the complex processing serially.	[reply]
Re: Re (tilly) 7: Parallel Downloads using Parallel::ForkManager or whatever works!!! by jamesluc (Novice) on Jan 09, 2002 at 19:53 UTC
I will look more closely at the style/design issues. You're right, I need something more portable. For the massive retrievals I use LINUX anyway. I am abandoning the Parallel processes until we're ready for multiple CPU's (if ever). After discussing your recent response, the former Perl Guru (I called him that, he would hate it if he knew I used Guru) suggests ParallelUserAgent to perform parallel downloads and then the serial processing. Sound familiar? This was not a waste of time and is turning out much better than I expected. Thanks Again	[reply]