Testing a web crawler

dlarochelle has asked for the wisdom of the Perl Monks concerning the following question:

I'm putting together a test suite for a web crawler.

In our initial version of the test suite, we hosted files on a publicly accessible web site and hard coded that URL into test cases which had the crawler download and process files obtained by spidering the hard coded URL.

We recently had to stop running the web server that we were directing our crawler to download from in our tests. Which breaks our tests. I'd like to find a way to test our crawler without having to publicly host files. Can anyone suggest an alternative approach?

My initial thought is that the test suite should somehow start up a fake web server which we could have tests try to download from. However, I couldn't find anyone suggesting a way to do that.

Any thoughts?

Thanks in advance.

Comment on Testing a web crawler

Replies are listed 'Best First'.
Re: Testing a web crawler by BrowserUk (Patriarch) on Mar 22, 2010 at 21:43 UTC
You could bundle a real web server with your code. 53k, configured via the command line and legally unrestricted. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply]
Re: Testing a web crawler by ikegami (Patriarch) on Mar 22, 2010 at 20:29 UTC
WWW::Mechanize uses HTTP::Daemon. Mind you, the latter doesn't test cleanly on Windows, but it works fine for testing WWW::Mechanize.	[reply]
Re^2: Testing a web crawler by Anonymous Monk on Mar 25, 2010 at 22:24 UTC
I ended up hacking something together using HTTP::Daemon. This worked, though I think there should be a more turn key approach. Thanks for the suggestions	[reply]
Re: Testing a web crawler by pemungkah (Priest) on Mar 23, 2010 at 09:33 UTC
I submitted an article to the Perl Journal (or was it the Perl Review...?) a while back about using Mojolicious to do this. Never did hear anything more. Hm. Well. It boils down to it being really, really easy to make Mojo respond any way you want to a URL, so you can have your spider "visit" the Mojo URL, get a page with a bunch of links in it, and then test all the different kinds of things that could happen (timeout, 404, 500, you name it) by sending appropriately-crafted URLs to the Mojo server - which all happen to be on the first page you crawl. You need one "that's all folks" URL to make the Mojolicious server go away, but that's easy enough to do.	[reply]
Re^2: Testing a web crawler by dlarochelle (Sexton) on Mar 25, 2010 at 22:32 UTC
pemungkah, Have you thought about turning your article into a blog post? I looked at Mojolicious but it seemed complicated so a guide to doing something like this would be useful. It would be a shame if the community wasn't able to benefit from the work you did writing up the article.	[reply]