sgifford has asked for the wisdom of the Perl Monks concerning the following question:

I've got a Perl script which manages a medium-sized database of real estate listings for a local realtors' association. It's accessible to the public, but lately we've seen in our logs a few people going through the whole database sequentially. Presumably these people are harvesting the database.

The realtors' association would like to stop this. What are some of the better techniques for doing this, with minimal annoyance to real customers, and without making the site inaccessible to people with text-only browsers or visual impairments?

I've had some ideas already, but I'm hoping for something better...

I'm really just fishing for ideas, so if anybody has any thoughts I'd love to hear them. Thanks!

Replies are listed 'Best First'.
•Re: State-of-the-art in Harvester Blocking
by merlyn (Sage) on Nov 23, 2003 at 15:54 UTC
    If you publish the information, they can download it. That's the first rule of the net.

    If they must sign an AUP before downloading, you can terminate their access if they were discovered violating it.

    If the information is clearly copyrighted, and they republish it, you can sue them, and maybe even get them nailed on criminal charges.

    One technique for winning a court battle on that is to include a few "ringers" in the listings: a fake listing which looks like all the others, but doesn't derive from publicly available data. If you see a published list somewhere else that includes your ringer, there's your legal trigger.

    The term "ringer" comes from phone lists that were leased in the old days: a few of the phone numbers "ring" into the list-owners inner office, so the owners knew how many times a list was being used, and by whom. Every public white page and yellow page book has a few ringers in it, for example, just so the phone book publishers can see who and how the data is reused. I know of a few corporate phone books that had a few ringers as well: a number listed in the book rang into corporate security instead, and was always summarily dealt with.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      this works for network security as well. take some scattered IPs from a large range that have never been used, route them to a box with snort and block anybody who hits them. they are guaranteed to be randomly scanning your space and up to no good.

      I must say I like the idea of the ringer analogy. I do believe I've heard of it somewhere before, but it was obviously pushed to the extremes of my memory. Just wanted to say thanks for bringing that back into my mind. I'm sure it'll become a useful concept in some project of mine within the next few years or so :)

Re: State-of-the-art in Harvester Blocking
by dws (Chancellor) on Nov 23, 2003 at 07:52 UTC
    What are some of the better techniques for doing this, with minimal annoyance to real customers, and without making the site inaccessible to people with text-only browsers or visual impairments?

    Try some social pressure. Once you're reasonably certain that you've detected a spider, arrange to direct it to a page that lays out the terms of use for the site. Include some sort of spider-proof way to get unbanned (e.g., manually copy/paste this URL). The spider will keep getting the same page, until the spider's author/runner notices. At that point, they might think "oops, busted" or they might work around you. If the former, great. If the latter, you'll have to give some thought about whether you want to get into an arms race.

Re: State-of-the-art in Harvester Blocking
by Coruscate (Sexton) on Nov 23, 2003 at 07:02 UTC

    No matter what you do implementation-wise, I don't see too many options as to what you can do about this. You have the data available to the public, so how are you going to determine who's "real" and who's a program trying to get all your results?

    That would be like perlmonks creating a security layer that only allows a certain computer to access a set number of nodes per hour. Things just don't work like that. As far as going by IP address, I wouldn't even consider that. Too many proxies, dynamic IPs and such going around. Using an IP address to pinpoint a specific user just doesn't work as well as it might have in the past.

    The only method I can think of that would be scalable and programatically simple would be to require authentication before being permitted to browse through the listings. Then, you could set limits on number of listings viewable per week or something like that.

    But even that is circumventable. The person harvesting your database simply registers multiple usernames (as many as it takes) to do the job (first user grabs first 30 listings, second one grabs next 30, etc etc). So there is no real way that I can think of to 100% effectively solve this "problem". The data is freely available. If your site dishes it out, there's no way to stop it from being programatically stolen.

Re: State-of-the-art in Harvester Blocking
by davido (Cardinal) on Nov 23, 2003 at 07:04 UTC
    Rather than putting someone's email address in the clear on a webpage, you might take another approach:

    • Place a "Send email to Realtor John Doe" link that invokes a properly written mail form and CGI script, rather than actually giving out the email address. User could compose an email to the realtor requesting his email address personally, if needed..
    • Place a "Send me this realtor's contact info." link. The user would provide an email address to which the realtor's contact info (including email address) could be emailed.
    • Ask the end user establish an account and to log in, providing a verifiable email address before gaining access to the database.

    As long as you put an email address in the clear (or even as a hidden field in a form), it can be harvested.

    Update: Woops, not discussing email address harvesting are we? Well, the principles I mentioned will also work for real-estate listings. Just keep the address confidential; require that the user send a message to get the address, etc. Require user log-in before addresses can be obtained. Authenticate new users by requiring valid email addresses, and so on.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
Re: State-of-the-art in Harvester Blocking
by CountZero (Bishop) on Nov 23, 2003 at 09:45 UTC
    One could think of presenting the same information in different formats: everytime the next page is presented the lay-out is somewhat different: can be as simple as switching zipcode and city around; or moving the contact address to the front and the link to the picture of the house to the back of the row; or ...

    This would certainly annoy any automatic harvesting of your pages.

    Or include "invisible" records (e.g. background and foreground color the same and very small point size) with good-looking but bogus information, which would then poison the harverster's database (although this will be bad for text-only browsers).

    If you do not have to cater for text-only browsers, one can think of providing the data in XML-format with the tags being given random names for this page only (and of course a different sequence of field-tags within the record tags, with some unused fields tags thrown in for good measure, e.g. two addresses and two phone numbers for each record, one of which will only be rendered) and making a "this page only" XSLT-file which translates client-side the data into HTML. Modern browsers will translate the XML into HTML on the fly, but it will take a fairly sophisticated harvester to make sense of it (or a lot of post-processing the raw data).

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      One could think of presenting the same information in different formats: everytime the next page is presented the lay-out is somewhat different: can be as simple as switching zipcode and city around; or moving the contact address to the front and the link to the picture of the house to the back of the row; or ... This would certainly annoy any automatic harvesting of your pages.

      It'd also likely annoy regular users of your site. Imagine if the PM voting buttons moved around (sometimes above a node, sometimes below, inconsistent order, etc), or if the nodelets' positions couldn't be guaranteed.
      Humans are pretty good at spotting differences in information if the information's layed out in a consistent manner. If you're comparing properties, those differences (price, number of bedrooms, city) will be all important, so making them harder to spot will also make your site harder to use.

      Or include "invisible" records (e.g. background and foreground color the same and very small point size) with good-looking but bogus information, which would then poison the harverster's database (although this will be bad for text-only browsers).

      I'm nit-picking now, but not just text-only browsers. What about people using high-contrast colour schemes in their browsers? I genuinely believe the OP has a difficult task if they want to preserve their goal of accessibility.


      davis
      It's not easy to juggle a pregnant wife and a troubled child, but somehow I managed to fit in eight hours of TV a day.

        Invisible records could be included in more ways than just color. Perhaps some trickery with CSS,spans,div tags, etc. The rearrangeing of content could be more of an HTML thing as well. Some tags could be mixed and matched to make the page look the same in a browser but be parsed differently by the harvester.


        ___________
        Eric Hodges
Re: State-of-the-art in Harvester Blocking
by Zaxo (Archbishop) on Nov 23, 2003 at 07:44 UTC

    The http/www protocol is intended for information you mean to publicize. It allows for subscription-only sites, but its statelessness is not friendly to such abstract notions of who's welcome to look.

    If you want to include random public customers and exclude competitors, you'd better figure out how to tell the difference. I surely don't know how to do that offhand, and I don't think it will be easy.

    After Compline,
    Zaxo

      The difference is that somebody who's harvesting listings will be getting much more data than a legitimate user. So the problem reduces to keeping track of whether a series of requests have come from the same user or not.

      The classic way to track a single user is with a cookie or session ID. One possibility would be to check for the presence of a cookie on the user's computer. If we don't find it, they get a message saying "Hang on while I register your computer with the system," and we just sleep(60) then set the cookie. After that, the cookie tracks how many searches they've done that day, and as the number gets too high, we start to sleep for longer and longer---essentially tarpitting for Web browsing.

      That would mean that either the user cooperates with the cookie and we can limit how many listings they can retreive in a day, or else they don't and they can get one listing per minute. With enough listings, the data would be stale before everything was retreived.

      The inconvenience wouldn't be too bad, since for normal operation it would be one 60-second wait ever for each computer.

      Of course, it would be terrible if cookies were disabled or blocked.

        What prevents them from simultaneously acquiring a dozen session IDs each from a different IP (via an anonymous proxy or something of the sort) though? Implementing this will complicate your code without particularly hindering the harvester.

        Makeshifts last the longest.

        One possibility would be to check for the presence of a cookie on the user's computer. If we don't find it, they get a message...

        Would that actually prevent anything? First of all you "punish" people who disable cookies in their browser (and I know quite a lot of people who do). Secondly, I don't know much about these harvester software packages, but even a simple Perl bot could use HTTP::Cookies so I assume these programs can too :(

        --
        B10m
      A reply falls below the community's threshold of quality. You may see it by logging in.
Re: State-of-the-art in Harvester Blocking
by aquarium (Curate) on Nov 23, 2003 at 11:42 UTC
    i thought the first question that needs to be answered is "why block?" if you simply don't want competitors to easily suck in addresses and prices then either don't list prices or put the prices as gifs, then at least it will take manual labor to view the pages and write down the prices. You could potentially be also restricting legitimate (computer savvy) buyers, who do their homework. A fairly hard to work around page would consist of a session, and the session being modified slightly each time a user comes to your page...and the session must follow the proper path to succeed. this however, would be a pain to frequent users of your site, as things would look different each time they try to do the same thing (get some listings) Perhaps you alarmed the non-tech people in your company too much when you discovered harvesters in your logs. i certainly would not worry about it unless bandwitdth was being adveresely affected.

      Yeah, I'm not sure why the data needs to be blocked---I was just hired to do the coding, not set policy.

      Another reason for my concern is that we've looked at incorporating other databases, and to do that we have to agree to do something about harvesting:

      • An IDXP displaying the IDX Database or any portion thereof shall make reasonable efforts to avoid "scraping" of the data by third parties or displaying of that data on any other Web site. Reasonable efforts shall include but not be limited to:
        1. Monitoring the Web site for signs that a third party is "scraping" data and
        2. Prominently posting notice that "Any use of search facilities of data on the site, other than by a consumer looking to purchase real estate, is prohibited."
        This section places a burden on the Broker and the Broker s Web site host to monitor their site. If it appears that a large number of hits is coming from a particular domain on the Web and that these hits may be the result of an automated process designed to gather or scrape data from the Broker's Web site for use somewhere else for a commercial purpose, the Broker must notify (Agency).

      So, I'd have to do a lot of convincing before I'd be able to say, "Guys, just don't worry about it!".

      And yes, I realize I could do just the two things above and probably we'd be safe contractually, but if I agree to make an effort to stop harvesters, I'd like to make sure I'm doing my honest best.

Re: State-of-the-art in Harvester Blocking
by Ninthwave (Chaplain) on Nov 23, 2003 at 11:47 UTC

    Use the /. posting limitation. Record the time of a request and refuse all requests from the same ip address for x time. Just make sure to display on the error of the time limit not being reached some information for people with proxies and who may have used the back button to get to an erroneos page.

    Not the best but makes data harvesting programs inefficient.

    "No matter where you go, there you are." BB
Re: State-of-the-art in Harvester Blocking
by delirium (Chaplain) on Nov 23, 2003 at 17:15 UTC
    lately we've seen in our logs a few people going through the whole database sequentially

    If you can track down who they are, sell them your database. Ultimately the companies interested in your data would be willing to pay money for it -- they're already paying someone to harvest it.

    You may be able to get from them more money than they pay their flunky, and no lawsuits to bother with.

      That's actually a fantastic idea, but unfortunately there's bureaucracy and a set of policies that are all but set in stone in the way of it.
Re: State-of-the-art in Harvester Blocking
by Aristotle (Chancellor) on Nov 23, 2003 at 13:09 UTC

    I think limiting the number of displayed results is the most promising path here. I'd do two things: pick a (probably weighted) random subset of results to display, and do this regardless of how the information was requested. This may require some pain threshold on the part of regular customers but shouldn't actually hinder regular business, while being a huge pain in the bottocks for a spider. Also, somebody refreshing the same results page more than thrice or so to get more of the results would give themselves away as a spider.

    Another idea would require to paginate your results, and look up the search parameters in a table with unique IDs. These would be used in your "next page" / "last page" links instead of having the parameters right inside the link. You would then put a random number of invisible links (modulo text mode browsers unfortunately) with invalid search IDs around the "next page" / "last page" links. Someone who keeps stumbling into these blind links is obviously a spider.

    The second idea isn't completely airtight, but very difficult to circumvent anyway. If you combine them, the people using the spiders will have to pay someone a lot of money to write a spider that can harvest your site without getting caught by the traps. Even then you could still check logs manually.

    Makeshifts last the longest.

Re: State-of-the-art in Harvester Blocking (javascript)
by zby (Vicar) on Nov 24, 2003 at 10:04 UTC
    I did once some harvesting (this was white hat - work around the buaurocracy in a really big company), and I tell you, the most annoying thing was javascript.