in reply to Async DNS with LWP

AnyEvent::DNS is an asynchronous resolver. I guess you can resolve the IPs using AnyEvent::DNS and then use LWP, but if you're using AnyEvent(::DNS) already, I would stay asynchronous and use AnyEvent::HTTP to do the HTTP requests.

Replies are listed 'Best First'.
Stateful Browsing and link extraction with AnyEvent::HTTP
by jc (Acolyte) on Oct 05, 2010 at 08:26 UTC
    Hi Corion, thanks for your advice. I had thought about using AnyEvent::DNS but there don't seem to be any obvious ways of getting LWP to use its results rather than doing its own synchronous resolution (via the OS). Now AnyEvent::HTTP uses AnyEvent::DNS out of the box and using AnyEvent::HTTP sounds like good advice. However, I'm wondering if this is now going to create more problems than it solves. Use of AnyEvent::HTTP implies implementing explicit logic to make the browser stateful and handle cookies and referers correctly. It also implies complications with generating output in a form that HTTP::LinkExtor can parse to extract links. Has anybody ever got a stateful web crawler based on AnyEvent::HTTP working?
      I had thought about using AnyEvent::DNS but there don't seem to be any obvious ways of getting LWP to use its results rather than doing its own synchronous resolution (via the OS).

      Surely, if you resolve the domain name yourself (asynchronously or not), and then supply the resolved dotted decimal as part of the url you supply to LWP, it won't have to, or be able to, do the resolution again itself?

      (I know; you don't like being called Shirley:)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I think it's not that simple because LWP needs to generate the appropriate Host: header. But if you generate that header in your code, I think you should be able to reuse most of what LWP already supplies.

        Alternatively, maybe you can even supply the appropriate opened (blocking) socket to LWP after having made an (asynchronous) connection to the host. But that involves source diving, surely.

        The web server may have multiple named virtual servers on the same IP, in which case it will check the Host passed in with the request to determine what webroot to use
        It will work for non-authenticated requests easy, but not for authenticated ones

      I think it shouldn't be too hard to push the results into a WWW::Mechanize object when they are available. WWW::Mechanize will then do the cookie extraction etc. and if you're using raw LWP, you're extracting the cookies yourself anyway. You then need to override/capture the request that WWW::Mechanize (or LWP) generates when you ->get or follow a link. This request is then again handed off to AnyEvent::HTTP.

      I'm not sure that it makes much sense to rewrite WWW::Mechanize to be based on AnyEvent::HTTP, because you will need asynchronicity all over the place anyway.

      You could look into spawning threads or simply spawning external processes to handle your requests, but if you're already looking into asynchronous resolvers, you're either prematurely optimizing the wrong end of the task or the overhead from launching threads or processes will eat into your time/latency budget.

        Hi, currently I have a multi process web crawler based on LWP::UserAgent and Parallel::ForkManager. Spawning off more parallel processes only seems to get speedup up to about 10 processors. I wanted to work on getting this working faster in serial before worrying about scaling to larger numbers of processors/threads. Asynchronous DNS and HTTP certainly seems the obvious way to go to me (unless there is anything else you can suggest). I'm not sure I follow how you can get Mechanize to use AnyEvent::HTTP results. Is there any chance of a quick little code snippet that illustrates what you mean? thanks.