chatterbox & search engines

When search engines crawl the site (good) they pick up the text in the chatterbox (bad). if you do a google search like

http://www.google.com/search?q=hellohiothellobr

You get a few perlmonks hits (this was a nonsense test string for a perl discussion in the chatterbox)

If you search on a nickname that was active in the chatterbox when the bot came through, however ..

http://www.google.com/search?q=jackdied

You get a bunch (google thankfully removes the redundant ones). Try it with any nick and you'll get at least a few hits.

Other than a robots.txt file exists, I don't know how specific you can be when telling a bot how to behave. Is there an easy way to change the chatterbox behavior without reducing the functionality?

-jack

ps, and yes I was doing a vanity search on google when I discovered the behavior.

Comment on chatterbox & search engines

Replies are listed 'Best First'.
(ichimunki) Re: chatterbox & search engines by ichimunki (Priest) on Oct 16, 2001 at 18:47 UTC
We could conceal the Chatterbox nodelet from unregistered monks, which would have the effect of concealing it from robots. This is done at E2. However, since this community is not E2 and we want to entice newbies into our midst... making it obvious to them that there is a useful chat feature makes it imperative to leave the Chatterbox nodelet visible to unregistered visitors. We might be able to play a guessing game with the UserAgent tag, and turn off the Chatterbox nodelet for any UA's that match a list of known spiders and robots. Since Google is our primary concern, that one should be easy to track down-- it's called GoogleBot I think and the UA string should be fairly evident in the Apache logs.	[reply]
Re: chatterbox & search engines (cloaking) by Hutta (Scribe) on Oct 16, 2001 at 20:40 UTC
Watching the UA header, and even maintaining a list of known search engine IP addresses is fairly common in the porn industry to disguise the layout of a page that makes it to the top listing on particular keywords. It's generally used by people who don't want someone to come along and build a similar page that would have the same high ranking. As such, the UA header is usless with that goal in mind, since it obviously can be spoofed. Anyway, some search engines despise this practice (called "cloaking"), since some people will get "XXX Hardcore Fuxor Pictures" listed on the keyword "Baseball" or something that equally undermines the effectiveness of the search engine. These engines (and I wish I had a current list) don't care if you have a legitimate reason for cloaking, and will ban you anyway. More rational search engines (a group I'm sure Google belongs in) will tolerate cloaking that doesn't impact how well they help people find sites that match their keywords. Check out IP Delivery, a poorly written perl program that sells for an absurd amount of money to cloak pages.	[reply]
Re: Re: chatterbox & search engines (cloaking) by ichimunki (Priest) on Oct 16, 2001 at 21:39 UTC
This is not a case of cloaking really, it's an attempt to remove temporal data from a static cache-- the only other ways to do that would be to remove the temporal data for non-registered clients, or to include a no-cache directive in robots.txt. Both are unacceptable since the former means AM can't see chat, and the latter because it makes all of PM non-cached. Since the primary "offender" is Google, I'd simply look at their UA string and give them a different page based on that. But I'm not an EE hacker, so I simply offer this as a "nice to have" to the development team. I have to wonder how well cloaking-detection even works without human intervention... you can't simply compare HTML from one GET to the next, the site could be using UA to send tuned HTML, or could have a random feature, or any number of other things resulting in slightly dissimilar HTML results. As such, it would almost have to undergo human review, or some similarity testing that PM, with or without Chatter, would probably pass.	[reply]
Re: chatterbox & search engines by chipmunk (Parson) on Oct 16, 2001 at 20:44 UTC
You can find further discussion of this issue in the thread Is it possible to stop caching?.	[reply]
Re: chatterbox & search engines by merlyn (Sage) on Oct 17, 2001 at 04:59 UTC
Note that if you search: site:perlmonks.org perlmonks, you'll get everything that google has cached, which was 68,500 pages at last count. -- Randal L. Schwartz, Perl hacker	[reply]