Are you sure that dynamic content isn't going to bite you? It gets pretty tricky w/o customizing the spider for each dynamic site... For instance, are you going to spider each of the 100,000 pages of the form:
http://www.perlmonks.org/index.pl?node_id=83485&lastnode_id=1
http://www.perlmonks.org/index.pl?node_id=83485&lastnode_id=1234
http://www.perlmonks.org/index.pl?node_id=83485&lastnode_id=12345678
etc...
Unless your bot is smart enough to know that perlmonks is "loosely unique" on node/node_id values, you could wind up in a nearly infinite loop. Just looking at the node_id/lastnode_id combo, there are something like 100,000 * 100,000 combinations.
Of course if you have a cusom spidering routine for each domain, you can put enough brains into your spider to avoid this problem.... but that severely limits the number of domains you can index.
I've done a similiar thing over at Making perlmonks seach engine friendly if that helps.
-Blake
In reply to Re: RadicalMatterDotCom
by blakem
in thread RadicalMatterDotCom
by Zecho
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |