Bad news for the people who were wanting to prevent Google from indexing this site: apparently they still do index the site via http://qs321.pair.com/~monkads/, try it: perl site:qs321.pair.com

This URL probably sneaks passed the rule set up on robots.txt because it's in a subdirectory, not in the root.

It may require addition of a slightly modified rule in the robots.txt file in the root of that domain.

Replies are listed 'Best First'.
Re: Google indexes Perlmonks
by demerphq (Chancellor) on Dec 23, 2004 at 13:15 UTC

    Interesting. I think you are right. Somebody out there linked via the server name, and Google found it. I think you are correct that both servers need a robots.txt in their root directory too.

    ---
    demerphq

Re: Google indexes Perlmonks
by Thilosophy (Curate) on Dec 24, 2004 at 01:04 UTC
    Bad news for the people who were wanting to prevent Google from indexing this site.

    I am new here and missed that particular discussion. Why do we want to prevent Google from indexing this site?

      Mostly because it would include nodelets, such as random bits of chatterbox conversation. That may turn up hits unrelated to what one was searching for. For indexing, thepen spiders the site and builds an offline archive of Perlmonks for consumption by Google. Try google perlmonks site:thepen.com — that will bring up past discussion on this topic too.

      Makeshifts last the longest.

        Considering that it isn't hard to determine a request comes from a google webcrawler, and that it's already possible to reader pages with or without nodelets based on a user profile, it wouldn't be too hard to serve google nodelet free pages, would it?
      I think that's mostly because it captures the Chatterbox nodelet. Searching Google for a username usually brought up a few shreds of Chatterbox conversations. Uh, like it does now, too. (It probably won't in future.) Just search for a username that often appears in the Chatterbox, and select to view the cached result.

      I'm not sure what the line

      # sorry, but misbehaved robots have ruined it for all of you.
      
      means, exactly. I was hoping the person responsible for that line, or for the final decision to block indexing, would have put in his two cents here. Maybe he still will.