POE&WWW::Mechanize or POE&POE::Component::Client::HTTP

spx2 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: POE&WWW::Mechanize or POE&POE::Component::Client::HTTP
by Ultra (Hermit) on Dec 24, 2007 at 11:58 UTC

Note that POE isn't about "threads", basically your program is a single process, and the tasks are split into atomic tasks called events which should be non-blocking.

For multiple events to coexists your POE program should use POE Wheels or Components that are not blocking (usually select based).

To run blocking pieces of code that aren't POE aware, such as WWW::Mechanize, you may spawn a POE::Wheel::Run session which is in fact another process so that the rest of you application doesn't block.

However, this wouldn't justify the use of POE because you'd either spawn a new process for each request, or use a pool of processes for each WWW::Mechanize client, each process fetching and parsing a page one at a time.

POE::Component::Client::HTTP isn't so powerfull as WWW::Mechanize, but you might use it to fetch the page contents, and then another session (separate process or not) do the processing of the page's content.

POE Cookbook - Web Client shows a simple usage of POE::Component::Client::HTTP

Dodge This!

[reply]

Re: POE&WWW::Mechanize or POE&POE::Component::Client::HTTP
by lestrrat (Deacon) on Dec 24, 2007 at 11:25 UTC

Using WWW::Mechanize from within POE doesn't give you any advantege of using POE, because WWW::Mechanize internally uses LWP::UserAgent, which blocks.

Note that POE isn't a multi-process framework. It's an event-based asynchronous framework. If you want to take advantage of POE, you need to use libraries that know about its asynchronous nature

[reply]

Re^2: POE&WWW::Mechanize or POE&POE::Component::Client::HTTP

by spx2 (Deacon) on Dec 24, 2007 at 11:50 UTC

thanks for the reply, Yes I was afraid of that,but POE::Component::Client::HTTP is very basic as functionality that was the problem...

[reply]

Re: POE&WWW::Mechanize or POE&POE::Component::Client::HTTP
by pc88mxer (Vicar) on Dec 24, 2007 at 22:41 UTC

In my experience making a data-munging operation "multi-threaded" is not the problem. The real problems with these programs are operational in nature like: making them robust in the presence of errors, making them restartable, being able to easily add or modify functionality and being able to test parts of the process in isolation.

The paradigm I've used over and over again is something I'll just call the "work pool" approach. Basically you have a "database" that contains a list of all the tasks that need to be performed. Then you have worker processes which acquire tasks, perform the task and then mark the task being completed. As part of the performing a task a worker process can add additional tasks to the database (or "work pool".)

This, of course, is not a new idea. Programs like sendmail operate in this fashion using the file system for the work pool. You can also use a real database (like mysql) or a persistent hash implementation like GDBM or even a commercial offering like mqseries.

The advantage of structuring your application like this are numerous. For starters you can make your application multi-threaded by using ordinary processes. Starting and stopping your application is now possible since the state of your application is persistently stored. Also if you, say, make each worker process task specific, it is very easy to control which tasks get performed. Tasks which can't be performed due to resource unavailability errors (e.g. remote web site not available, not enough local disk space, etc.) can just be put back in the work pool to be executed later.

In the past I've just rolled my own work pool implementation from scratch. Sometimes I used the file system, other times I used a database. This is such a useful and common pattern that it would be very helpful to have a framework (specifically a perl framework) for implementing work pools.

Hope this helps.

[reply]
[d/l]
[select]

Re^2: POE&WWW::Mechanize or POE&POE::Component::Client::HTTP

by jasonk (Parson) on Dec 25, 2007 at 03:18 UTC

This is such a useful and common pattern that it would be very helpful to have a framework (specifically a perl framework) for implementing work pools.

Take your pick...

We're not surrounded, we're in a target-rich environment!

[reply]

Re^3: POE&WWW::Mechanize or POE&POE::Component::Client::HTTP

by lestrrat (Deacon) on Dec 25, 2007 at 10:22 UTC

Swarmage is being re-written as we speak, so don't use. However, if you just want a crawler framework, checkout Gungho (but it's not really a WWW::Mechanize type of framework)

[reply]