Re: State-of-the-art in Harvester Blocking

Replies are listed 'Best First'.
Re: Re: State-of-the-art in Harvester Blocking by sgifford (Prior) on Nov 23, 2003 at 08:41 UTC
The difference is that somebody who's harvesting listings will be getting much more data than a legitimate user. So the problem reduces to keeping track of whether a series of requests have come from the same user or not. The classic way to track a single user is with a cookie or session ID. One possibility would be to check for the presence of a cookie on the user's computer. If we don't find it, they get a message saying "Hang on while I register your computer with the system," and we just `sleep(60)` then set the cookie. After that, the cookie tracks how many searches they've done that day, and as the number gets too high, we start to `sleep` for longer and longer---essentially tarpitting for Web browsing. That would mean that either the user cooperates with the cookie and we can limit how many listings they can retreive in a day, or else they don't and they can get one listing per minute. With enough listings, the data would be stale before everything was retreived. The inconvenience wouldn't be too bad, since for normal operation it would be one 60-second wait ever for each computer. Of course, it would be terrible if cookies were disabled or blocked.	[reply] [d/l] [select]
Re^3: State-of-the-art in Harvester Blocking by Aristotle (Chancellor) on Nov 23, 2003 at 12:55 UTC
What prevents them from simultaneously acquiring a dozen session IDs each from a different IP (via an anonymous proxy or something of the sort) though? Implementing this will complicate your code without particularly hindering the harvester. Makeshifts last the longest.	[reply]
Re: Re^3: State-of-the-art in Harvester Blocking by perrin (Chancellor) on Nov 23, 2003 at 22:00 UTC
In practice, very few harvesters are savvy enough to figure out what's going on and determined enough to build an effectively coordinated system for rotating the IDs. The session ID approach will stop most people.	[reply]
Re: Re^3: State-of-the-art in Harvester Blocking by sgifford (Prior) on Nov 23, 2003 at 17:58 UTC
Good point...I hadn't thought about a harvester doing multiple sessions in parallel, because that's not how they've been doing it thus far.	[reply]
Re: State-of-the-art in Harvester Blocking by b10m (Vicar) on Nov 23, 2003 at 16:44 UTC
One possibility would be to check for the presence of a cookie on the user's computer. If we don't find it, they get a message... Would that actually prevent anything? First of all you "punish" people who disable cookies in their browser (and I know quite a lot of people who do). Secondly, I don't know much about these harvester software packages, but even a simple Perl bot could use HTTP::Cookies so I assume these programs can too :( -- B10m	[reply]
Re: Re: State-of-the-art in Harvester Blocking by perrin (Chancellor) on Nov 23, 2003 at 22:02 UTC
It will work against most harvesters, since if they accept the cookie you will be able to track how many downloads they get for the day and limit them. In general, commercial programs for this are only sophisticated enough to use cookies, not sophisticated enough to rotate between dozens of sessions in order to prevent being limited.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.