in reply to Google indexes Perlmonks

Bad news for the people who were wanting to prevent Google from indexing this site.

I am new here and missed that particular discussion. Why do we want to prevent Google from indexing this site?

Replies are listed 'Best First'.
Re^2: Google indexes Perlmonks
by Aristotle (Chancellor) on Dec 24, 2004 at 03:02 UTC

    Mostly because it would include nodelets, such as random bits of chatterbox conversation. That may turn up hits unrelated to what one was searching for. For indexing, thepen spiders the site and builds an offline archive of Perlmonks for consumption by Google. Try google perlmonks site:thepen.com — that will bring up past discussion on this topic too.

    Makeshifts last the longest.

      Considering that it isn't hard to determine a request comes from a google webcrawler, and that it's already possible to reader pages with or without nodelets based on a user profile, it wouldn't be too hard to serve google nodelet free pages, would it?

        I think the problem is that its not unlikely that a web crawler ends up hammering the system following way way too many links generating zillions of unnecessary page fetches. For instance, the spider lands on the front page. That has links to the high activity sections along with a large number of root level nodes. So then it follows each section and each front paged node. Now each one of those nodes has links to itself. So it wont just index the root node and the thread below, but rather the whole thing and then each one singly. Quite possibly it will do this twice for the front paged nodes. Then it will also index each users home node, which of course leads to lists of nodes written by that author which I imagine will eventually result in google single handedly fetching pretty close to each and every node on the site. This is a load that we just dont need.

        Of course the CB and various other bits that we dont really want indexed is also a reason. But i should think the core reason is that our site isnt particularly amenable to automated crawlers. The whole point of the blakems static mirror is that it is static and updated rarely and at a low load threshold. Once its mirrored Google can search and index it as it likes, there wont be unnecessary load on our DB servers so we dont really care then.

        ---
        demerphq

        No, but would it also be less work than thepen? Maybe. Maybe not.

        Makeshifts last the longest.

Re^2: Google indexes Perlmonks
by bart (Canon) on Dec 24, 2004 at 07:55 UTC
    I think that's mostly because it captures the Chatterbox nodelet. Searching Google for a username usually brought up a few shreds of Chatterbox conversations. Uh, like it does now, too. (It probably won't in future.) Just search for a username that often appears in the Chatterbox, and select to view the cached result.

    I'm not sure what the line

    # sorry, but misbehaved robots have ruined it for all of you.
    
    means, exactly. I was hoping the person responsible for that line, or for the final decision to block indexing, would have put in his two cents here. Maybe he still will.