comment on

Thanks for the guidance. I understand the ongoing dialogue concept. Cool!

The indentation style got lost in my unfamiliarity with the posting methods. Good advice that I don't always follow, but will endeavor to follow from now on. The strict.pm thing I will study. I'm not too familiar with it.Man, I've got a lot to learn.

Anyway, I have read the documentation for Parallel::ForkManager, its pretty straight forward. I had a similar version to what you suggested working with the LWP::Simple getstore just to see if it worked. It did. However, what I’m trying to do is incorporate the ForkManager into a working metacrawler. The LWP::Simple getstore is O.K., but not preferred.

The metacrawler, I call it “MEGA-Metacrawler” retrieves pages, based on keyword lists, from various search engines, basically stores the web pages into a hash, processes them for secondary patterns, and then spits out the processed results. I am using parts of LWP::Simple, LWP::UserAgent, HTTP::Status, HTTP::Request, and Digest::Perl::MD5. Its effective but, dog sloooooowwwww against the hundreds/thousands of pages that must be reviewed and processed.

I have had our resident Perl Guru look at the code for efficiency, duplication etc … It seems to be O.K. in that department. The problem seems to be two-fold, the speed or lack thereof downloading one web page at a time, and the processing of one downloaded page at a time in the pattern matching routine. Thus, the need for the parallel processes.

BTW, in the code I posted, I initially left out the “);” in the following line:

$res = $ua->request($req, "$name.html"

(Yeah I just realized, I need to use line numbers next time!)

Nevertheless, if I comment out all of the “$pm->” lines of code it will download the pages in the hash. This also works in the MEGA-Metacrawler. But again its fast for 3 or 4 urls and slow for 4000 – 5000. Finally, I believe that I need to use the GET instead of the GETSTORE. So the code I have provided is a nearly exact excerpt from the metacrawler (which is nearly 1000 lines and thus not provided). I intend on plugging the Parallel routine into the MEGA-metacrawler once the final version works properly.

I will spend the next day(s) continuing to study merlyn’s article referenced above and re-reading the ParallelForkManager documentation to gain a better understanding of it.

Thanks again

Your Humble Perl Initiate

In reply to Re: Re (tilly) 5: Parallel Downloads using Parallel::ForkManager or whatever works!!! by jamesluc
in thread Parallel Downloads using Parallel::ForkManager or whatever works!!! by jamesluc

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.