When search engines crawl the site (good) they pick up the text in the chatterbox (bad). if you do a google search like

http://www.google.com/search?q=hellohiothellobr

You get a few perlmonks hits (this was a nonsense test string for a perl discussion in the chatterbox)

If you search on a nickname that was active in the chatterbox when the bot came through, however ..

http://www.google.com/search?q=jackdied

You get a bunch (google thankfully removes the redundant ones). Try it with any nick and you'll get at least a few hits.

Other than a robots.txt file exists, I don't know how specific you can be when telling a bot how to behave. Is there an easy way to change the chatterbox behavior without reducing the functionality?

-jack

ps, and yes I was doing a vanity search on google when I discovered the behavior.

Replies are listed 'Best First'.
(ichimunki) Re: chatterbox & search engines
by ichimunki (Priest) on Oct 16, 2001 at 18:47 UTC
    We could conceal the Chatterbox nodelet from unregistered monks, which would have the effect of concealing it from robots. This is done at E2. However, since this community is not E2 and we want to entice newbies into our midst... making it obvious to them that there is a useful chat feature makes it imperative to leave the Chatterbox nodelet visible to unregistered visitors.

    We might be able to play a guessing game with the UserAgent tag, and turn off the Chatterbox nodelet for any UA's that match a list of known spiders and robots. Since Google is our primary concern, that one should be easy to track down-- it's called GoogleBot I think and the UA string should be fairly evident in the Apache logs.
      Watching the UA header, and even maintaining a list of known search engine IP addresses is fairly common in the porn industry to disguise the layout of a page that makes it to the top listing on particular keywords.

      It's generally used by people who don't want someone to come along and build a similar page that would have the same high ranking. As such, the UA header is usless with that goal in mind, since it obviously can be spoofed.

      Anyway, some search engines despise this practice (called "cloaking"), since some people will get "XXX Hardcore Fuxor Pictures" listed on the keyword "Baseball" or something that equally undermines the effectiveness of the search engine. These engines (and I wish I had a current list) don't care if you have a legitimate reason for cloaking, and will ban you anyway.

      More rational search engines (a group I'm sure Google belongs in) will tolerate cloaking that doesn't impact how well they help people find sites that match their keywords.

      Check out IP Delivery, a poorly written perl program that sells for an absurd amount of money to cloak pages.
        This is not a case of cloaking really, it's an attempt to remove temporal data from a static cache-- the only other ways to do that would be to remove the temporal data for non-registered clients, or to include a no-cache directive in robots.txt. Both are unacceptable since the former means AM can't see chat, and the latter because it makes all of PM non-cached. Since the primary "offender" is Google, I'd simply look at their UA string and give them a different page based on that. But I'm not an EE hacker, so I simply offer this as a "nice to have" to the development team.

        I have to wonder how well cloaking-detection even works without human intervention... you can't simply compare HTML from one GET to the next, the site could be using UA to send tuned HTML, or could have a random feature, or any number of other things resulting in slightly dissimilar HTML results. As such, it would almost have to undergo human review, or some similarity testing that PM, with or without Chatter, would probably pass.
Re: chatterbox & search engines
by chipmunk (Parson) on Oct 16, 2001 at 20:44 UTC
Re: chatterbox & search engines
by merlyn (Sage) on Oct 17, 2001 at 04:59 UTC