dlarochelle has asked for the wisdom of the Perl Monks concerning the following question:

I'm putting together a test suite for a web crawler.

In our initial version of the test suite, we hosted files on a publicly accessible web site and hard coded that URL into test cases which had the crawler download and process files obtained by spidering the hard coded URL.

We recently had to stop running the web server that we were directing our crawler to download from in our tests. Which breaks our tests. I'd like to find a way to test our crawler without having to publicly host files. Can anyone suggest an alternative approach?

My initial thought is that the test suite should somehow start up a fake web server which we could have tests try to download from. However, I couldn't find anyone suggesting a way to do that.

Any thoughts?

Thanks in advance.

Replies are listed 'Best First'.
Re: Testing a web crawler
by BrowserUk (Patriarch) on Mar 22, 2010 at 21:43 UTC

    You could bundle a real web server with your code. 53k, configured via the command line and legally unrestricted.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Testing a web crawler
by ikegami (Patriarch) on Mar 22, 2010 at 20:29 UTC
    WWW::Mechanize uses HTTP::Daemon. Mind you, the latter doesn't test cleanly on Windows, but it works fine for testing WWW::Mechanize.

      I ended up hacking something together using HTTP::Daemon. This worked, though I think there should be a more turn key approach.

      Thanks for the suggestions

Re: Testing a web crawler
by pemungkah (Priest) on Mar 23, 2010 at 09:33 UTC
    I submitted an article to the Perl Journal (or was it the Perl Review...?) a while back about using Mojolicious to do this. Never did hear anything more. Hm. Well.

    It boils down to it being really, really easy to make Mojo respond any way you want to a URL, so you can have your spider "visit" the Mojo URL, get a page with a bunch of links in it, and then test all the different kinds of things that could happen (timeout, 404, 500, you name it) by sending appropriately-crafted URLs to the Mojo server - which all happen to be on the first page you crawl. You need one "that's all folks" URL to make the Mojolicious server go away, but that's easy enough to do.

      pemungkah,

      Have you thought about turning your article into a blog post? I looked at Mojolicious but it seemed complicated so a guide to doing something like this would be useful. It would be a shame if the community wasn't able to benefit from the work you did writing up the article.