in reply to State-of-the-art in Harvester Blocking

The http/www protocol is intended for information you mean to publicize. It allows for subscription-only sites, but its statelessness is not friendly to such abstract notions of who's welcome to look.

If you want to include random public customers and exclude competitors, you'd better figure out how to tell the difference. I surely don't know how to do that offhand, and I don't think it will be easy.

After Compline,
Zaxo

  • Comment on Re: State-of-the-art in Harvester Blocking

Replies are listed 'Best First'.
Re: Re: State-of-the-art in Harvester Blocking
by sgifford (Prior) on Nov 23, 2003 at 08:41 UTC

    The difference is that somebody who's harvesting listings will be getting much more data than a legitimate user. So the problem reduces to keeping track of whether a series of requests have come from the same user or not.

    The classic way to track a single user is with a cookie or session ID. One possibility would be to check for the presence of a cookie on the user's computer. If we don't find it, they get a message saying "Hang on while I register your computer with the system," and we just sleep(60) then set the cookie. After that, the cookie tracks how many searches they've done that day, and as the number gets too high, we start to sleep for longer and longer---essentially tarpitting for Web browsing.

    That would mean that either the user cooperates with the cookie and we can limit how many listings they can retreive in a day, or else they don't and they can get one listing per minute. With enough listings, the data would be stale before everything was retreived.

    The inconvenience wouldn't be too bad, since for normal operation it would be one 60-second wait ever for each computer.

    Of course, it would be terrible if cookies were disabled or blocked.

      What prevents them from simultaneously acquiring a dozen session IDs each from a different IP (via an anonymous proxy or something of the sort) though? Implementing this will complicate your code without particularly hindering the harvester.

      Makeshifts last the longest.

        In practice, very few harvesters are savvy enough to figure out what's going on and determined enough to build an effectively coordinated system for rotating the IDs. The session ID approach will stop most people.
        Good point...I hadn't thought about a harvester doing multiple sessions in parallel, because that's not how they've been doing it thus far.
      One possibility would be to check for the presence of a cookie on the user's computer. If we don't find it, they get a message...

      Would that actually prevent anything? First of all you "punish" people who disable cookies in their browser (and I know quite a lot of people who do). Secondly, I don't know much about these harvester software packages, but even a simple Perl bot could use HTTP::Cookies so I assume these programs can too :(

      --
      B10m
        It will work against most harvesters, since if they accept the cookie you will be able to track how many downloads they get for the day and limit them. In general, commercial programs for this are only sophisticated enough to use cookies, not sophisticated enough to rotate between dozens of sessions in order to prevent being limited.
    A reply falls below the community's threshold of quality. You may see it by logging in.