in reply to My Crawler script

Definitely agree - you're going way too low-level and making lots of extra work for yourself.

Notes:

  1. Make sure you obey robots.txt. libwww-perl will give you the necessary tools for this.
  2. Make sure your crawler is polite and doesn't hammer the site to death, fetching pages as fast as it can. Add a short sleep() - even 1 second - between pages.

Replies are listed 'Best First'.
Re^2: My Crawler script
by Sary (Novice) on Mar 13, 2011 at 11:56 UTC
    Please explain note # 1. Whats robots? Gee whos the genius who voted -1? No one is forced to help me or even read my posts. Direct replies or reference for research are both appreciated but simple and sometimes dumb question will be ask anyway...
      robots.txt is an agreed-upon standard (see this site for lots of details) for limiting access to websites, specifically for crawlers.

      It defines

      • who is allowed to crawl the site
      • what paths they may or may not crawl at that site
      The robots.txt file is very important, as it keeps you from crawling links that could cause problems at the remote site, either by consuming large amounts of resources (e.g., an "add to shopping cart" link; following all of these on a site could generate a very large shopping cart indeed!) or by causing actual problems (e.g., a "delete" link or "report spam" link).

      Your crawler should read the robots.txt and follow its strictures - including skipping the site altogether if you see

      User-agent: * Disallow: /
      or a "disallow" that specifies your particular user agent.

      I should note that some sites are a bit weird about who crawls them; at Blekko we had a certain site that wasn't sure they agreed with us on some philosophical points, to put it kindly, and they specifically blocked our crawler. This could happen, and it's important to be polite and follow the robots.txt directives to prevent people from taking more aggressive action, like blocking your IP (or worse, entire IP block).

      (Edit: updated last sentence to clarify it slightly.)