in reply to chatterbox & search engines

We could conceal the Chatterbox nodelet from unregistered monks, which would have the effect of concealing it from robots. This is done at E2. However, since this community is not E2 and we want to entice newbies into our midst... making it obvious to them that there is a useful chat feature makes it imperative to leave the Chatterbox nodelet visible to unregistered visitors.

We might be able to play a guessing game with the UserAgent tag, and turn off the Chatterbox nodelet for any UA's that match a list of known spiders and robots. Since Google is our primary concern, that one should be easy to track down-- it's called GoogleBot I think and the UA string should be fairly evident in the Apache logs.
  • Comment on (ichimunki) Re: chatterbox & search engines

Replies are listed 'Best First'.
Re: chatterbox & search engines (cloaking)
by Hutta (Scribe) on Oct 16, 2001 at 20:40 UTC
    Watching the UA header, and even maintaining a list of known search engine IP addresses is fairly common in the porn industry to disguise the layout of a page that makes it to the top listing on particular keywords.

    It's generally used by people who don't want someone to come along and build a similar page that would have the same high ranking. As such, the UA header is usless with that goal in mind, since it obviously can be spoofed.

    Anyway, some search engines despise this practice (called "cloaking"), since some people will get "XXX Hardcore Fuxor Pictures" listed on the keyword "Baseball" or something that equally undermines the effectiveness of the search engine. These engines (and I wish I had a current list) don't care if you have a legitimate reason for cloaking, and will ban you anyway.

    More rational search engines (a group I'm sure Google belongs in) will tolerate cloaking that doesn't impact how well they help people find sites that match their keywords.

    Check out IP Delivery, a poorly written perl program that sells for an absurd amount of money to cloak pages.
      This is not a case of cloaking really, it's an attempt to remove temporal data from a static cache-- the only other ways to do that would be to remove the temporal data for non-registered clients, or to include a no-cache directive in robots.txt. Both are unacceptable since the former means AM can't see chat, and the latter because it makes all of PM non-cached. Since the primary "offender" is Google, I'd simply look at their UA string and give them a different page based on that. But I'm not an EE hacker, so I simply offer this as a "nice to have" to the development team.

      I have to wonder how well cloaking-detection even works without human intervention... you can't simply compare HTML from one GET to the next, the site could be using UA to send tuned HTML, or could have a random feature, or any number of other things resulting in slightly dissimilar HTML results. As such, it would almost have to undergo human review, or some similarity testing that PM, with or without Chatter, would probably pass.