First, if you're going to pay attention to meta tag keywords, you'd better be crawling only your own site. In general meta information is not just useless, it's actively deceptive. Trusting it generally is worse than a waste of time. You'll end up with a database full of lies. Sad, but true.
Second, if you're crawling pages that aren't yours, you'd better obey the robot rules. Use LWP::RobotUA here, with a reasonable time limit. The default minute delay's fine, but dropping it down as low as 10 or 20 seconds between requests is probably fine. (I'd leave it at the minute delay, personally)
If you're not going to use LWP::RobotUA, and are crawling other people's pages, then you'd darned well better make sure you space out the request. (Your ISP may want you do to this for your own pages--Snagging a couple of thousand pages over a cable modem or other reasonably high bandwidth connection can be pretty harsh)
If you're going to do it, then do it right, be polite, and respect the "Keep off the Grass" signs.
In reply to Re: Site Crawler
by Elian
in thread Site Crawler
by Frisbeeman
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |