in reply to Re (tilly) 5: Parallel Downloads using Parallel::ForkManager or whatever works!!!
in thread Parallel Downloads using Parallel::ForkManager or whatever works!!!
The indentation style got lost in my unfamiliarity with the posting methods. Good advice that I don't always follow, but will endeavor to follow from now on. The strict.pm thing I will study. I'm not too familiar with it.Man, I've got a lot to learn.
Anyway, I have read the documentation for Parallel::ForkManager, its pretty straight forward. I had a similar version to what you suggested working with the LWP::Simple getstore just to see if it worked. It did. However, what I’m trying to do is incorporate the ForkManager into a working metacrawler. The LWP::Simple getstore is O.K., but not preferred.
The metacrawler, I call it “MEGA-Metacrawler” retrieves pages, based on keyword lists, from various search engines, basically stores the web pages into a hash, processes them for secondary patterns, and then spits out the processed results. I am using parts of LWP::Simple, LWP::UserAgent, HTTP::Status, HTTP::Request, and Digest::Perl::MD5. Its effective but, dog sloooooowwwww against the hundreds/thousands of pages that must be reviewed and processed.
I have had our resident Perl Guru look at the code for efficiency, duplication etc … It seems to be O.K. in that department. The problem seems to be two-fold, the speed or lack thereof downloading one web page at a time, and the processing of one downloaded page at a time in the pattern matching routine. Thus, the need for the parallel processes.
BTW, in the code I posted, I initially left out the “);” in the following line:
$res = $ua->request($req, "$name.html"
(Yeah I just realized, I need to use line numbers next time!)
Nevertheless, if I comment out all of the “$pm->” lines of code it will download the pages in the hash. This also works in the MEGA-Metacrawler. But again its fast for 3 or 4 urls and slow for 4000 – 5000. Finally, I believe that I need to use the GET instead of the GETSTORE. So the code I have provided is a nearly exact excerpt from the metacrawler (which is nearly 1000 lines and thus not provided). I intend on plugging the Parallel routine into the MEGA-metacrawler once the final version works properly.
I will spend the next day(s) continuing to study merlyn’s article referenced above and re-reading the ParallelForkManager documentation to gain a better understanding of it.
Thanks again
Your Humble Perl Initiate
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re (tilly) 7: Parallel Downloads using Parallel::ForkManager or whatever works!!!
by tilly (Archbishop) on Jan 09, 2002 at 14:54 UTC | |
by jamesluc (Novice) on Jan 09, 2002 at 19:53 UTC |