in reply to Re^2: Async DNS with LWP
in thread Async DNS with LWP

WWW::Mechanize doesn't actually do any socket work. It lets LWP do it, so nothing needs to be done. Keep in mind that Coro is cooperative multitasking, so your sockets can't receive anything if your crawler is spending a lot of time not waiting for data.

Replies are listed 'Best First'.
Re^4: Async DNS with LWP
by jc (Acolyte) on Oct 05, 2010 at 22:13 UTC
    Sounds great! So, we just replace the LWP module with your AnyEvent::HTTP / Coro version and things should work for Mechanize out of the box? Not sure I see what you mean by your point about Coro. If my crawler isn't spending any time waiting for data I will be extremely happy that it is crawling as fast as my network connection allows.
      Not sure I see what you mean by your point about Coro.

      Coro isn't threaded! (Despite the blatant lies in the documentation!).

      It is cooperative task-switching--like Windows 3.1--which means that if one of your coro instances is busy, none of the others will do anything at all until it either: finishes; goes into a wait for IO; or yields.

      It also means that regardless of how many cores you have, it will only ever use one of them.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        OK,

        so at this point I'm now thinking:

        * LWP and Mechanize are nice toys to make a quick proof of concept of a real web crawler but in practise not useful for anything more than low bandwidth automated tasks.

        * With AnyEvent::HTTP and Coro you can make a proof of concept which performs better but you're still not quite there

        * In order to build a real performing parallel web crawler that makes the best use of network resources performing parallel asynchronous DNS and parallel HTTP requests then I either need to use Perl's bloated thread model and directly use Perl's UDP and TCP interface or I need to give up on Perl and go ahead and build this in C

        It really seems a shame that there are so many Perl modules dedicated to crawling tasks and yet none of them really have proved up to the job of being the back end of a high performance crawler that makes best use of network resources. The fact that people have dedicated so much time to making such modules would seem to suggest that many Perl users have an interest in web crawling. I'm wondering (new to PerlMonks, please help me out here) if there's anything we can do to set up a team of Perl developers that can improve the situation and develop easy to use Perl modules that are up to the job?