in reply to •Re: Web Site Mapper
in thread Web Site Mapper

Perhaps I overlooked something in WWW::Robot's documentation, but it appears to me that it wouldn't quite work for what it seems he wanted to do, since, as he said in the update, for his purposes he needed to ignore robot.txt rules, and I couldn't see any way to turn that off.

And on the second point, it appears that the sub will (almost) immediately return from a link that it had already hit that page, do it should stop once it has exhausted all pages that it hasn't already index, and finishes returning from a likely rather long list of links that it has already been to.

Perhap's it might be more efficent to check if a link already exists before invoking the sub's foreach?

Replies are listed 'Best First'.
Re: Re: •Re: Web Site Mapper
by hardburn (Abbot) on Feb 16, 2004 at 14:24 UTC

    Perhap's it might be more efficent to check if a link already exists before invoking the sub's foreach?

    More efficient, perhaps. However, in my case, I needed to search a number of virtualhosts which may end up linking to each other's front page. If you need to check if a page exists before invoking the spider sub, then that code needs to be duplicated in the initial call. By having the check done at the beginning of the sub, you only need to have one place where the check takes place.

    In any case, I doubt the efficiency advantage matters. This program isn't CPU-intensive. It will more likely be limited by your connection to the target domains and (probably to a lesser extent) by the memory overhead of recursive calls and the data structure.

    Update: I did notice that CPU usage spikes at the end of the program, when YAML::Dump is called. But that's only done once. (I was worried for a moment that it had managed to get itself stuck in an infinate loop and started eating massive system resources *g*).

    ----
    : () { :|:& };:

    Note: All code is untested, unless otherwise stated