in reply to Creating a web crawler (theory)

Don't forget to save URLs you've visited in a lookup table (e.g., a hash). Don't visit a URL you'd already been to, at least not if you were there recently.

This is simple, but so is the rage of a sysadmin whose site is being crawled in a loop. :)

Use a User-Agent: header that allows admins to contact you should they need to.

Oh, and honor robots.txt, yes?