spx2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm planning on doing an application wich will extract
pages from a site and process them in parallel.
Do you think it would be okay to open up say 20 sessions in
POE and have them each use a WWW::Mechanize object to
get the web pages and maybe have one thread to control
all of these threads so that it provides them with work
to do,maybe using a queue from wich each thread helps
itself with a link wich it processes.
the information processed would then have to go to a
database,wich would be locked whenever writing is
done on it.
I saw on cpan there also exists this module
POE::Component::Client::HTTP , and I'm not sure if to use
it,as it does something very interesting:
"lets other sessions run while HTTP transactions are being
processed",but I would like to also have the functionality
I get from WWW::Mechanize in this
POE::Component::Client::HTTP .
What should I do ?

EDIT:
Actually my question is about POE::Component::Client::HTTP,
I like it ,but it doesn't have all the useful features of
WWW::Mechanize.
  • Comment on POE&WWW::Mechanize or POE&POE::Component::Client::HTTP

Replies are listed 'Best First'.
Re: POE&WWW::Mechanize or POE&POE::Component::Client::HTTP
by Ultra (Hermit) on Dec 24, 2007 at 11:58 UTC

    Note that POE isn't about "threads", basically your program is a single process, and the tasks are split into atomic tasks called events which should be non-blocking.

    For multiple events to coexists your POE program should use POE Wheels or Components that are not blocking (usually select based).

    To run blocking pieces of code that aren't POE aware, such as WWW::Mechanize, you may spawn a POE::Wheel::Run session which is in fact another process so that the rest of you application doesn't block.

    However, this wouldn't justify the use of POE because you'd either spawn a new process for each request, or use a pool of processes for each WWW::Mechanize client, each process fetching and parsing a page one at a time.

    POE::Component::Client::HTTP isn't so powerfull as WWW::Mechanize, but you might use it to fetch the page contents, and then another session (separate process or not) do the processing of the page's content.

    POE Cookbook - Web Client shows a simple usage of POE::Component::Client::HTTP

    Dodge This!
Re: POE&WWW::Mechanize or POE&POE::Component::Client::HTTP
by lestrrat (Deacon) on Dec 24, 2007 at 11:25 UTC

    Using WWW::Mechanize from within POE doesn't give you any advantege of using POE, because WWW::Mechanize internally uses LWP::UserAgent, which blocks.

    Note that POE isn't a multi-process framework. It's an event-based asynchronous framework. If you want to take advantage of POE, you need to use libraries that know about its asynchronous nature

      thanks for the reply, Yes I was afraid of that,but POE::Component::Client::HTTP is very basic as functionality that was the problem...
Re: POE&WWW::Mechanize or POE&POE::Component::Client::HTTP
by pc88mxer (Vicar) on Dec 24, 2007 at 22:41 UTC
    Here's my 2 cents -- ++ this node if you like the idea.

    In my experience making a data-munging operation "multi-threaded" is not the problem. The real problems with these programs are operational in nature like: making them robust in the presence of errors, making them restartable, being able to easily add or modify functionality and being able to test parts of the process in isolation.

    The paradigm I've used over and over again is something I'll just call the "work pool" approach. Basically you have a "database" that contains a list of all the tasks that need to be performed. Then you have worker processes which acquire tasks, perform the task and then mark the task being completed. As part of the performing a task a worker process can add additional tasks to the database (or "work pool".)

    This, of course, is not a new idea. Programs like sendmail operate in this fashion using the file system for the work pool. You can also use a real database (like mysql) or a persistent hash implementation like GDBM or even a commercial offering like mqseries.

    The advantage of structuring your application like this are numerous. For starters you can make your application multi-threaded by using ordinary processes. Starting and stopping your application is now possible since the state of your application is persistently stored. Also if you, say, make each worker process task specific, it is very easy to control which tasks get performed. Tasks which can't be performed due to resource unavailability errors (e.g. remote web site not available, not enough local disk space, etc.) can just be put back in the work pool to be executed later.

    In the past I've just rolled my own work pool implementation from scratch. Sometimes I used the file system, other times I used a database. This is such a useful and common pattern that it would be very helpful to have a framework (specifically a perl framework) for implementing work pools.

    Hope this helps.

        Swarmage is being re-written as we speak, so don't use. However, if you just want a crawler framework, checkout Gungho (but it's not really a WWW::Mechanize type of framework)