Re: blocking site scrapers

AOL user(s):

$ whois 64.12.116.201

OrgName:    America Online, Inc.
OrgID:      AMERIC-158
Address:    10600 Infantry Ridge Road
City:       Manassas
StateProv:  VA
PostalCode: 20109
Country:    US

NetRange:   64.12.0.0 - 64.12.255.255
CIDR:       64.12.0.0/16
NetName:    AOL-MTC
NetHandle:  NET-64-12-0-0-1
Parent:     NET-64-0-0-0-0
NetType:    Direct Assignment
NameServer: DNS-01.NS.AOL.COM
NameServer: DNS-02.NS.AOL.COM
Comment:
RegDate:    1999-12-13
Updated:    1999-12-16

RTechHandle: AOL-NOC-ARIN
RTechName:   America Online, Inc.
RTechPhone:  +1-703-265-4670
RTechEmail:  domains@aol.net
[download]

But, here's your problem. AOL users route through always rotating proxies, so that's PROBABLY the same user. HOWEVER, there's no guarantee that they'll come from 64.12.116.x the next time they decide to scrape your site.

Given that they're likely from AOL, no - I doubt they can spoof their source IP

Do notice, however, that it's a /16 - you could technically block THAT entire range. but that might limit your audience more than you'd like.

--chargrill

$/  =  q#(\w)#  ;  sub  sig { print scalar reverse  join  ' ',  @_  } 
+ sig
map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu
+" );
[download]

Comment on Re: blocking site scrapers Select or Download Code

Replies are listed 'Best First'.
Re^2: blocking site scrapers by Anonymous Monk on Feb 07, 2006 at 04:43 UTC
I'm not going to "block" anyone. My idea is to setup a script that will kill if the refresh is too quick. I'd record their IP in a database along with a timestamp that they were last seen. If they try to reload a page in X seconds, the rest of the page won't load for 5 seconds. This will cut back bots and may even get them to stop, hopefully. But this would also filter search engine bots, too. So I'm stuck :(	[reply]
Re^3: blocking site scrapers by chargrill (Parson) on Feb 07, 2006 at 04:48 UTC
Well, that certainly makes more sense than say, dynamically altering firewall rules (yes, I've seen that). :) A well behaved search engine bot SHOULD be discernable by their UA (doubt the script kiddies bother to change theirs), and you may want to note whether a client requests or has requested /robots.txt... Granted, none of this is a sure thing, but a combination of "tests" may get you close enough to what you want without restricting others... --chargrill `$/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ } + sig map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu +" );` [download]	[reply] [d/l]
Re^4: blocking site scrapers by tirwhan (Abbot) on Feb 07, 2006 at 10:50 UTC
What's wrong with dynamically altering firewall rules? Before answering you should perhaps consider that firewalls can be used for tarpitting (i.e. slowing down connections to the point of unusability) or rate-limiting individual addresses or address ranges, as well as simple blocking. In fact, if you have to resort to an IP-based policy (generally a bad idea), a well-implemented firewall solution is usually a better idea than server-side request mangling. To answer the OP's question, if you're on Linux you may want to look at the "recent" iptables extension. This article provides an introduction on how to use it. If you're on a different OS, have a look at that OS's firewall documentation. All dogma is stupid.	[reply] [d/l]
Re^5: blocking site scrapers by chargrill (Parson) on Feb 07, 2006 at 15:48 UTC
Re^3: blocking site scrapers by spiritway (Vicar) on Feb 07, 2006 at 05:46 UTC
You could start to build a second database (or add a field in the present one) that would include IP numbers that requested robots.txt, or that identified themselves as Googlebot, SurveyBot, Yahoo!, ysearch, sohu-search, msnbot, RufusBot, netcraft.com, MMCrawler, Teoma, ConveraMultimediaCrawler, and whatever else seems to be reputable. My main criterion for a bot being OK is if it asks for robots.txt. However, this isn't 100% reliable. There's a bot out there that uses robots.txt to scrape only the forbidden directories and pages, ignoring the allowed ones. It's called WebVulnScan or WebVulnCrawl. That's just plain rude. But just a thought - if a search bot is burning your bandwidth, isn't that still something you'd want to avoid?	[reply]
Re^4: blocking site scrapers by Anonymous Monk on Feb 07, 2006 at 14:00 UTC
Hi. Seeing if they checked for a robots.txt file sounds like a great idea but how would I know whether they did or not?	[reply]
Re^5: blocking site scrapers by spiritway (Vicar) on Feb 07, 2006 at 16:20 UTC
Re^3: blocking site scrapers by mkirank (Chaplain) on Feb 09, 2006 at 07:10 UTC
If you have Mod_Perl Installed on your server you could use this technique given in the mod_perl book Blocking Greedy Clients	[reply]