Re^2: Multithread Web Crawler

Agreed here.

A web crawling application is not going to see benefit from the light-weightedness of multiple threads since it is by it's nature fairly heavy.

If you decide that threads don't really hold an advantage for your application you can save yourself a whole load of work by forking off processes.

As pointed to in a recent node, Parallel::ForkManager might be of use to you. The module description includes:

This module is intended for use in operations that can be done in parallel where the number of processes to be forked off should be limited. Typical use is a downloader which will be retrieving hundreds/thousands of files.

Sounds right up your tree? Or is that down your tree? (I never did work out where the roots for a red-black tree would go).

Comment on Re^2: Multithread Web Crawler

Replies are listed 'Best First'.
Re^3: Multithread Web Crawler by xuqy (Initiate) on Sep 23, 2005 at 13:40 UTC
Thank you so much. I do tried Parallel::ForkManager but come up with a puzzle: How to share data between processes? To avoid crawling the same page repeatly, a global tied hash has to be shared by all the crawling processes. I experimented and found that all the forked processes just ended up with the same crawling history. Can you do me a favor to suggest a patch to it?	[reply]