Re: Async DNS with LWP

Replies are listed 'Best First'.
Stateful Browsing and link extraction with AnyEvent::HTTP by jc (Acolyte) on Oct 05, 2010 at 08:26 UTC
Hi Corion, thanks for your advice. I had thought about using AnyEvent::DNS but there don't seem to be any obvious ways of getting LWP to use its results rather than doing its own synchronous resolution (via the OS). Now AnyEvent::HTTP uses AnyEvent::DNS out of the box and using AnyEvent::HTTP sounds like good advice. However, I'm wondering if this is now going to create more problems than it solves. Use of AnyEvent::HTTP implies implementing explicit logic to make the browser stateful and handle cookies and referers correctly. It also implies complications with generating output in a form that HTTP::LinkExtor can parse to extract links. Has anybody ever got a stateful web crawler based on AnyEvent::HTTP working?	[reply]
Re: Stateful Browsing and link extraction with AnyEvent::HTTP by BrowserUk (Patriarch) on Oct 05, 2010 at 09:20 UTC
I had thought about using AnyEvent::DNS but there don't seem to be any obvious ways of getting LWP to use its results rather than doing its own synchronous resolution (via the OS). Surely, if you resolve the domain name yourself (asynchronously or not), and then supply the resolved dotted decimal as part of the url you supply to LWP, it won't have to, or be able to, do the resolution again itself? (I know; you don't like being called Shirley:) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re^2: Stateful Browsing and link extraction with AnyEvent::HTTP by Corion (Patriarch) on Oct 05, 2010 at 09:24 UTC
I think it's not that simple because LWP needs to generate the appropriate `Host:` header. But if you generate that header in your code, I think you should be able to reuse most of what LWP already supplies. Alternatively, maybe you can even supply the appropriate opened (blocking) socket to LWP after having made an (asynchronous) `connect`ion to the host. But that involves source diving, surely.	[reply] [d/l]
Re^2: Stateful Browsing and link extraction with AnyEvent::HTTP by james2vegas (Chaplain) on Oct 05, 2010 at 09:26 UTC
The web server may have multiple named virtual servers on the same IP, in which case it will check the Host passed in with the request to determine what webroot to use	[reply]
Re^3: Stateful Browsing and link extraction with AnyEvent::HTTP by BrowserUk (Patriarch) on Oct 05, 2010 at 09:33 UTC
Re^2: Stateful Browsing and link extraction with AnyEvent::HTTP by Anonymous Monk on Oct 05, 2010 at 10:13 UTC
It will work for non-authenticated requests easy, but not for authenticated ones Read more... (3 kB)	[reply] [d/l]
Re^3: Stateful Browsing and link extraction with AnyEvent::HTTP by BrowserUk (Patriarch) on Oct 05, 2010 at 10:23 UTC
Re^4: Stateful Browsing and link extraction with AnyEvent::HTTP by Anonymous Monk on Oct 05, 2010 at 10:31 UTC
Re: Stateful Browsing and link extraction with AnyEvent::HTTP by Corion (Patriarch) on Oct 05, 2010 at 08:31 UTC
I think it shouldn't be too hard to push the results into a WWW::Mechanize object when they are available. WWW::Mechanize will then do the cookie extraction etc. and if you're using raw LWP, you're extracting the cookies yourself anyway. You then need to override/capture the request that WWW::Mechanize (or LWP) generates when you `->get` or follow a link. This request is then again handed off to AnyEvent::HTTP. I'm not sure that it makes much sense to rewrite WWW::Mechanize to be based on AnyEvent::HTTP, because you will need asynchronicity all over the place anyway. You could look into spawning threads or simply spawning external processes to handle your requests, but if you're already looking into asynchronous resolvers, you're either prematurely optimizing the wrong end of the task or the overhead from launching threads or processes will eat into your time/latency budget.	[reply] [d/l]
Re^2: Stateful Browsing and link extraction with AnyEvent::HTTP by jc (Acolyte) on Oct 05, 2010 at 09:28 UTC
Hi, currently I have a multi process web crawler based on LWP::UserAgent and Parallel::ForkManager. Spawning off more parallel processes only seems to get speedup up to about 10 processors. I wanted to work on getting this working faster in serial before worrying about scaling to larger numbers of processors/threads. Asynchronous DNS and HTTP certainly seems the obvious way to go to me (unless there is anything else you can suggest). I'm not sure I follow how you can get Mechanize to use AnyEvent::HTTP results. Is there any chance of a quick little code snippet that illustrates what you mean? thanks.	[reply]
Re^3: Stateful Browsing and link extraction with AnyEvent::HTTP by Corion (Patriarch) on Oct 05, 2010 at 09:35 UTC
AnyEvent::DNS is effectively synchronous? by jc (Acolyte) on Oct 05, 2010 at 11:30 UTC
Some notes below your chosen depth have not been shown here