in reply to Stateful Browsing and link extraction with AnyEvent::HTTP
in thread Async DNS with LWP

I think it shouldn't be too hard to push the results into a WWW::Mechanize object when they are available. WWW::Mechanize will then do the cookie extraction etc. and if you're using raw LWP, you're extracting the cookies yourself anyway. You then need to override/capture the request that WWW::Mechanize (or LWP) generates when you ->get or follow a link. This request is then again handed off to AnyEvent::HTTP.

I'm not sure that it makes much sense to rewrite WWW::Mechanize to be based on AnyEvent::HTTP, because you will need asynchronicity all over the place anyway.

You could look into spawning threads or simply spawning external processes to handle your requests, but if you're already looking into asynchronous resolvers, you're either prematurely optimizing the wrong end of the task or the overhead from launching threads or processes will eat into your time/latency budget.

Replies are listed 'Best First'.
Re^2: Stateful Browsing and link extraction with AnyEvent::HTTP
by jc (Acolyte) on Oct 05, 2010 at 09:28 UTC
    Hi, currently I have a multi process web crawler based on LWP::UserAgent and Parallel::ForkManager. Spawning off more parallel processes only seems to get speedup up to about 10 processors. I wanted to work on getting this working faster in serial before worrying about scaling to larger numbers of processors/threads. Asynchronous DNS and HTTP certainly seems the obvious way to go to me (unless there is anything else you can suggest). I'm not sure I follow how you can get Mechanize to use AnyEvent::HTTP results. Is there any chance of a quick little code snippet that illustrates what you mean? thanks.

      If you are currently using LWP::UserAgent, you get a HTTP::Response back. If you are using AnyEvent::HTTP, you get the data and the headers back. From the data and the headers, you can either (re)construct a HTTP::Response or do your own stuff with them directly.

      If you want to handle cookies, see HTTP::Cookies which can extract cookies from a HTTP::Message.

      If you want to (re)use WWW::Mechanize, see its ->update_html method and (for header handling and general message handling (the source of) its ->request method.

        Hi, I've been trying to get parallel DNS working with AnyEvent::DNS. My first poor attempt is effectively the same as synchronous DNS because the program waits for the reply before moving on the next iteration of the loop:
        #!/usr/bin/perl -w use strict; use warnings; use AnyEvent::DNS; my ($domain); my (@domains,@condvars); my $resolver = AnyEvent::DNS::resolver; while ($domain = <>;) { # clean off newline chomp $domain; # send dns packets $resolver->resolve($domain,"*",my $condvar = AnyEvent->condvar); # receive dns packets $condvar->recv; print "$domain\n"; }
        My second attempt was slightly better in that it can send off ten different DNS packets but receives a reply from the last request ten times (not what I wanted):
        #!/usr/bin/perl -w use strict; use warnings; use AnyEvent::DNS; my ($domain); my (@domains,@condvars); my $resolver = AnyEvent::DNS::resolver; while (1) { # send dns packets for my $i (1..10) { $domain = <>; # clean off newline chomp $domain; $resolver->resolve($domain,"*",my $condvar = AnyEvent->condvar); push @condvars, $condvar; } # receive dns packets while (my $condvar = pop @condvars) { $condvar->recv; print "$domain\n"; } }
        The problem is that $condvar seems to be working like a reference to an object and so each of the 10 $condvars in the stack end up being whatever the last $condvar was. I've tried typeglobbing but this just causes compilation errors. Does anybody know how to do this right?