I found this site The Web Robots Pages useful for sysadmins and would-be Web robot programmers, who somehow don't happen to know what robots.txt is.

Anyone knows how common or uncommon robots.txt is?

Or anyone would like to share any Do's and Dont's about writing a Web robot (or anything that programmatically fetches something for you via the Web)?

It seems rather common that many people did not specify the "agent" (whose default value is "libwww-perl/#.##") when using LWP::UserAgent. It may or may not matter, depending on the sites your script or robot is visiting.

Replies are listed 'Best First'.
Re: Web Robot
by Abigail-II (Bishop) on Jul 16, 2003 at 22:22 UTC
    For me, there's just one "don't":
    • Don't piss off the owners of the site.

    From that rule, many others can be deduced:

    • Obey robots.txt.
    • Don't flood a site.
    • Don't republish, especially not anything that might be copyrighted.
    • Be very conservative when visiting sites that maintain themselves by showing ads. Anytime you fetch something without fetching the ad(s), it costs them money, without any gain for them.
    • For anything you need to register for, don't do anything that conflicts with their terms of service.

    Remember that your robot will be a guest in other peoples territories. Act accordingly.

    Abigail

      You'll note that most major search engines violate the majority, if not all, of these rules. So don't take them too seriously.

      Don't let that stop you from playing nice though :)

        Most major search engines don't stick to those rules, true. That means that if you use a robots.txt on your site, be aware of that.

        But I think, just because some big companies/search engines don't stick to the rules doesn't mean that you should do the same. I always go by the maxime, don't do unto someone else, what you wouldn't want done to you/your site.

        Just my 2 Rappen (Swiss equivalent to cents).

        --cs

        There are nights when the wolves are silent and only the moon howls. - George Carlin

        The question is ... does this really matter? I mean if your pages are indexed they are more likely to be found, therefore you get more hits, more ad views and in the end more money. So I would not care that much if a search engine flooded my server with requests once a month.

        So IMHO the only violation that might matter is not obeying robots.txt. Actually could someone give me some example of a reasonable robots.txt usage?

        Jenda
        Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
           -- Rick Osborne

        Edit by castaway: Closed small tag in signature

Re: Web Robot Exclusion
by simonm (Vicar) on Jul 17, 2003 at 03:15 UTC
    Robots.txt files only appear on a minority of sites, but should always be respected. A good first step is to start with LWP::RobotUA (or LWP::Parallel::RobotUA), which will implement the Robots Exclusion Standard for you.