in reply to Multithread Web Crawler

You might want real processes there, instead of "threads". See any of my "link checker" articles that involve fork.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

Replies are listed 'Best First'.
Re^2: Multithread Web Crawler
by aufflick (Deacon) on Sep 22, 2005 at 07:40 UTC
    Agreed here.

    A web crawling application is not going to see benefit from the light-weightedness of multiple threads since it is by it's nature fairly heavy.

    If you decide that threads don't really hold an advantage for your application you can save yourself a whole load of work by forking off processes.

    As pointed to in a recent node, Parallel::ForkManager might be of use to you. The module description includes:

    This module is intended for use in operations that can be done in parallel where the number of processes to be forked off should be limited. Typical use is a downloader which will be retrieving hundreds/thousands of files.
    Sounds right up your tree? Or is that down your tree? (I never did work out where the roots for a red-black tree would go).
      Thank you so much. I do tried Parallel::ForkManager but come up with a puzzle: How to share data between processes? To avoid crawling the same page repeatly, a global tied hash has to be shared by all the crawling processes. I experimented and found that all the forked processes just ended up with the same crawling history. Can you do me a favor to suggest a patch to it?