in reply to blocking site scrapers

AOL user(s):

$ whois 64.12.116.201 OrgName: America Online, Inc. OrgID: AMERIC-158 Address: 10600 Infantry Ridge Road City: Manassas StateProv: VA PostalCode: 20109 Country: US NetRange: 64.12.0.0 - 64.12.255.255 CIDR: 64.12.0.0/16 NetName: AOL-MTC NetHandle: NET-64-12-0-0-1 Parent: NET-64-0-0-0-0 NetType: Direct Assignment NameServer: DNS-01.NS.AOL.COM NameServer: DNS-02.NS.AOL.COM Comment: RegDate: 1999-12-13 Updated: 1999-12-16 RTechHandle: AOL-NOC-ARIN RTechName: America Online, Inc. RTechPhone: +1-703-265-4670 RTechEmail: domains@aol.net

But, here's your problem. AOL users route through always rotating proxies, so that's PROBABLY the same user. HOWEVER, there's no guarantee that they'll come from 64.12.116.x the next time they decide to scrape your site.

Given that they're likely from AOL, no - I doubt they can spoof their source IP

Do notice, however, that it's a /16 - you could technically block THAT entire range. but that might limit your audience more than you'd like.



--chargrill
$/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ } + sig map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu +" );

Replies are listed 'Best First'.
Re^2: blocking site scrapers
by Anonymous Monk on Feb 07, 2006 at 04:43 UTC
    I'm not going to "block" anyone. My idea is to setup a script that will kill if the refresh is too quick. I'd record their IP in a database along with a timestamp that they were last seen. If they try to reload a page in X seconds, the rest of the page won't load for 5 seconds. This will cut back bots and may even get them to stop, hopefully.

    But this would also filter search engine bots, too. So I'm stuck :(

      Well, that certainly makes more sense than say, dynamically altering firewall rules (yes, I've seen that). :)

      A well behaved search engine bot SHOULD be discernable by their UA (doubt the script kiddies bother to change theirs), and you may want to note whether a client requests or has requested /robots.txt...

      Granted, none of this is a sure thing, but a combination of "tests" may get you close enough to what you want without restricting others...



      --chargrill
      $/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ } + sig map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu +" );

        What's wrong with dynamically altering firewall rules? Before answering you should perhaps consider that firewalls can be used for tarpitting (i.e. slowing down connections to the point of unusability) or rate-limiting individual addresses or address ranges, as well as simple blocking. In fact, if you have to resort to an IP-based policy (generally a bad idea), a well-implemented firewall solution is usually a better idea than server-side request mangling.

        To answer the OP's question, if you're on Linux you may want to look at the "recent" iptables extension. This article provides an introduction on how to use it. If you're on a different OS, have a look at that OS's firewall documentation.


        All dogma is stupid.

      You could start to build a second database (or add a field in the present one) that would include IP numbers that requested robots.txt, or that identified themselves as Googlebot, SurveyBot, Yahoo!, ysearch, sohu-search, msnbot, RufusBot, netcraft.com, MMCrawler, Teoma, ConveraMultimediaCrawler, and whatever else seems to be reputable.

      My main criterion for a bot being OK is if it asks for robots.txt. However, this isn't 100% reliable. There's a bot out there that uses robots.txt to scrape only the forbidden directories and pages, ignoring the allowed ones. It's called WebVulnScan or WebVulnCrawl. That's just plain rude.

      But just a thought - if a search bot is burning your bandwidth, isn't that still something you'd want to avoid?

        Hi.

        Seeing if they checked for a robots.txt file sounds like a great idea but how would I know whether they did or not?

      If you have Mod_Perl Installed on your server you could use this technique given in the mod_perl book
      Blocking Greedy Clients