I think limiting the number of displayed results is the most promising path here. I'd do two things: pick a (probably weighted) random subset of results to display, and do this regardless of how the information was requested. This may require some pain threshold on the part of regular customers but shouldn't actually hinder regular business, while being a huge pain in the bottocks for a spider. Also, somebody refreshing the same results page more than thrice or so to get more of the results would give themselves away as a spider.
Another idea would require to paginate your results, and look up the search parameters in a table with unique IDs. These would be used in your "next page" / "last page" links instead of having the parameters right inside the link. You would then put a random number of invisible links (modulo text mode browsers unfortunately) with invalid search IDs around the "next page" / "last page" links. Someone who keeps stumbling into these blind links is obviously a spider.
The second idea isn't completely airtight, but very difficult to circumvent anyway. If you combine them, the people using the spiders will have to pay someone a lot of money to write a spider that can harvest your site without getting caught by the traps. Even then you could still check logs manually.
Makeshifts last the longest.
In reply to Re: State-of-the-art in Harvester Blocking
by Aristotle
in thread State-of-the-art in Harvester Blocking
by sgifford
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |