in reply to Re: Stateful Browsing and link extraction with AnyEvent::HTTP
in thread Async DNS with LWP

Hi, currently I have a multi process web crawler based on LWP::UserAgent and Parallel::ForkManager. Spawning off more parallel processes only seems to get speedup up to about 10 processors. I wanted to work on getting this working faster in serial before worrying about scaling to larger numbers of processors/threads. Asynchronous DNS and HTTP certainly seems the obvious way to go to me (unless there is anything else you can suggest). I'm not sure I follow how you can get Mechanize to use AnyEvent::HTTP results. Is there any chance of a quick little code snippet that illustrates what you mean? thanks.
  • Comment on Re^2: Stateful Browsing and link extraction with AnyEvent::HTTP

Replies are listed 'Best First'.
Re^3: Stateful Browsing and link extraction with AnyEvent::HTTP
by Corion (Patriarch) on Oct 05, 2010 at 09:35 UTC

    If you are currently using LWP::UserAgent, you get a HTTP::Response back. If you are using AnyEvent::HTTP, you get the data and the headers back. From the data and the headers, you can either (re)construct a HTTP::Response or do your own stuff with them directly.

    If you want to handle cookies, see HTTP::Cookies which can extract cookies from a HTTP::Message.

    If you want to (re)use WWW::Mechanize, see its ->update_html method and (for header handling and general message handling (the source of) its ->request method.

      Hi, I've been trying to get parallel DNS working with AnyEvent::DNS. My first poor attempt is effectively the same as synchronous DNS because the program waits for the reply before moving on the next iteration of the loop:
      #!/usr/bin/perl -w use strict; use warnings; use AnyEvent::DNS; my ($domain); my (@domains,@condvars); my $resolver = AnyEvent::DNS::resolver; while ($domain = <>;) { # clean off newline chomp $domain; # send dns packets $resolver->resolve($domain,"*",my $condvar = AnyEvent->condvar); # receive dns packets $condvar->recv; print "$domain\n"; }
      My second attempt was slightly better in that it can send off ten different DNS packets but receives a reply from the last request ten times (not what I wanted):
      #!/usr/bin/perl -w use strict; use warnings; use AnyEvent::DNS; my ($domain); my (@domains,@condvars); my $resolver = AnyEvent::DNS::resolver; while (1) { # send dns packets for my $i (1..10) { $domain = <>; # clean off newline chomp $domain; $resolver->resolve($domain,"*",my $condvar = AnyEvent->condvar); push @condvars, $condvar; } # receive dns packets while (my $condvar = pop @condvars) { $condvar->recv; print "$domain\n"; } }
      The problem is that $condvar seems to be working like a reference to an object and so each of the 10 $condvars in the stack end up being whatever the last $condvar was. I've tried typeglobbing but this just causes compilation errors. Does anybody know how to do this right?

        You get ten times the "one domain", because you print $domain 10 times. What value would you $domain to have at which time?

        Looking at the documentation of AnyEvent::DNS, the SYNOPSIS section shows how to retrieve results from a query. I'm not sure what problems you have with that.

        If you want to do a callback-driven approach, take a look at the AnyEvent::DNS::srv subroutine. Again, the usage seems quite straightforward, as the resolved IPs get passed as parameters.