blocking site scrapers

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: blocking site scrapers by mirod (Canon) on Feb 07, 2006 at 08:25 UTC
If your web server is Apache, you could use mod_throttle, otherwise a look at merlyn's Slow down the download! could help.	[reply]
Re^2: blocking site scrapers by Anonymous Monk on Feb 07, 2006 at 14:02 UTC
If I am using a web host, could I install mod_throttle on my own directory? Or is that specifically for root?	[reply]
Re^3: blocking site scrapers by mirod (Canon) on Feb 07, 2006 at 14:31 UTC
If mod_throttle is not already installed, then no, you can't use it. That's where you turn to the second link I gave above ;--)	[reply]
Re^2: blocking site scrapers by pingo (Hermit) on Feb 07, 2006 at 18:53 UTC
There's also mod_bwshare, though I haven't used it myself.	[reply]
Re: blocking site scrapers by chargrill (Parson) on Feb 07, 2006 at 04:33 UTC
AOL user(s): $ whois 64.12.116.201 OrgName: America Online, Inc. OrgID: AMERIC-158 Address: 10600 Infantry Ridge Road City: Manassas StateProv: VA PostalCode: 20109 Country: US NetRange: 64.12.0.0 - 64.12.255.255 CIDR: 64.12.0.0/16 NetName: AOL-MTC NetHandle: NET-64-12-0-0-1 Parent: NET-64-0-0-0-0 NetType: Direct Assignment NameServer: DNS-01.NS.AOL.COM NameServer: DNS-02.NS.AOL.COM Comment: RegDate: 1999-12-13 Updated: 1999-12-16 RTechHandle: AOL-NOC-ARIN RTechName: America Online, Inc. RTechPhone: +1-703-265-4670 RTechEmail: domains@aol.net [download] But, here's your problem. AOL users route through always rotating proxies, so that's PROBABLY the same user. HOWEVER, there's no guarantee that they'll come from 64.12.116.x the next time they decide to scrape your site. Given that they're likely from AOL, no - I doubt they can spoof their source IP Do notice, however, that it's a /16 - you could technically block THAT entire range. but that might limit your audience more than you'd like. --chargrill `$/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ } + sig map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu +" );` [download]	[reply] [d/l] [select]
Re^2: blocking site scrapers by Anonymous Monk on Feb 07, 2006 at 04:43 UTC
I'm not going to "block" anyone. My idea is to setup a script that will kill if the refresh is too quick. I'd record their IP in a database along with a timestamp that they were last seen. If they try to reload a page in X seconds, the rest of the page won't load for 5 seconds. This will cut back bots and may even get them to stop, hopefully. But this would also filter search engine bots, too. So I'm stuck :(	[reply]
Re^3: blocking site scrapers by chargrill (Parson) on Feb 07, 2006 at 04:48 UTC
Well, that certainly makes more sense than say, dynamically altering firewall rules (yes, I've seen that). :) A well behaved search engine bot SHOULD be discernable by their UA (doubt the script kiddies bother to change theirs), and you may want to note whether a client requests or has requested /robots.txt... Granted, none of this is a sure thing, but a combination of "tests" may get you close enough to what you want without restricting others... --chargrill `$/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ } + sig map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu +" );` [download]	[reply] [d/l]
Re^4: blocking site scrapers by tirwhan (Abbot) on Feb 07, 2006 at 10:50 UTC
Re^5: blocking site scrapers by chargrill (Parson) on Feb 07, 2006 at 15:48 UTC
Re^3: blocking site scrapers by spiritway (Vicar) on Feb 07, 2006 at 05:46 UTC
You could start to build a second database (or add a field in the present one) that would include IP numbers that requested robots.txt, or that identified themselves as Googlebot, SurveyBot, Yahoo!, ysearch, sohu-search, msnbot, RufusBot, netcraft.com, MMCrawler, Teoma, ConveraMultimediaCrawler, and whatever else seems to be reputable. My main criterion for a bot being OK is if it asks for robots.txt. However, this isn't 100% reliable. There's a bot out there that uses robots.txt to scrape only the forbidden directories and pages, ignoring the allowed ones. It's called WebVulnScan or WebVulnCrawl. That's just plain rude. But just a thought - if a search bot is burning your bandwidth, isn't that still something you'd want to avoid?	[reply]
Re^4: blocking site scrapers by Anonymous Monk on Feb 07, 2006 at 14:00 UTC
Re^5: blocking site scrapers by spiritway (Vicar) on Feb 07, 2006 at 16:20 UTC
Re^3: blocking site scrapers by mkirank (Chaplain) on Feb 09, 2006 at 07:10 UTC
If you have Mod_Perl Installed on your server you could use this technique given in the mod_perl book Blocking Greedy Clients	[reply]
Re: blocking site scrapers by spiritway (Vicar) on Feb 07, 2006 at 03:07 UTC
I've got the same problem, except that I also get attempted exploits (infected servers). It seems that your IP numbers all start out the same, and only the last octet varies. The chances are that the IP numbers are being pulled from a pool as needed, and then returned when not needed any more. This means that your miscreant might have different IP numbers each time (s)he signs on. Thus, I would block the whole range of IP numbers, 64.12.116.0 through 64.12.116.255. You run a slight risk of blocking out an innocent party, but it's probably worth it to save your bandwidth. You may also want to use whois or traceroute (or tracert if you're using Windows) to find out whose server it is. You can get the owner's contact information from whois, and notify them of the problem. They may take steps to stop this from happening, depending on how ethical they are. As for spoofing IP numbers, it's not likely that someone would go to that much bother, just to steal some photos. And I'm wondering why they have to keep scraping, if they've already got them. Sounds kind of dumb to me - or else, they've written a very rude bot.	[reply]
Re: blocking site scrapers by DrHyde (Prior) on Feb 07, 2006 at 10:41 UTC
The technique I talked about in Re: Re: Password hacker killer is probably going to be useful. I imagine you'll also need to have things decay out of your block list. Or rather, as you don't want to inconvenience real users too much, just skr1pt kiddies, I'd make it a delay list - if someone's in the list, make their downloads sloooooow. As someone else has mentioned, those particular addresses are AOL proxies, which indicates that you may want to score whole address ranges instead of individual addresses. This snippet from one of the scripts I use when hunting spammers will help. `use Net::DNS; my $IP = '64.12.116.67'; my($ASN, $network, $network_bits) = @{ Net::DNS::Resolver->new() ->query(join('.', reverse(split(/\./, $IP))).".asn.routeviews.or +g", "TXT", "IN") ->{answer}->[0]->{char_str_list} }; print "$network/$network_bits\n";` [download]	[reply] [d/l]
Re: blocking site scrapers by pboin (Deacon) on Feb 07, 2006 at 14:07 UTC
There's a lot of things to think about here, as other monks have well-noted. One thing I could add for you to think about: You could also inadvertently block a router that's doing NAT for an entire organization. Everyone behind that router would appear to come from the same IP address in your logs. You may end up deciding that blocking a whole organization is OK, but at least consider what you're dealing with. One of the more clever ways to stop robots, IMO is to have a tarpit link / picture that triggers a penalty period. Bots are dumb, and they'll fall for it every time, unless a human codes around your particular tarpit. My favorite example is the tarpit for SQLite on their wiki.	[reply]
Re: blocking site scrapers by Cody Pendant (Prior) on Feb 07, 2006 at 04:27 UTC
That address belongs to some outfit called "America Online"... http://www.samspade.org/t/ipwhois?a=64.12.116.67 ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re: blocking site scrapers by monarch (Priest) on Feb 07, 2006 at 06:22 UTC
Seeing as it is a photo site is adding a small captcha image going to be a problem during login?	[reply]
Re: blocking site scrapers by lima1 (Curate) on Feb 07, 2006 at 19:19 UTC
maybe you could make some tests if you can't use the mentioned apache modules: $points = 0; whats the referer? does the user came from your homepage? good sign: $points++ is the useagent something you know (Internet Exploder, Gecko, Opera, Google) $points++ supports user agent javascript (most spambots don't do this) $points++ ip requested no page in the last x secs? $points++; `if ($points > 1) { show_gallery(); }` [download]	[reply] [d/l]
Re: blocking site scrapers by Xenograg (Scribe) on Feb 07, 2006 at 17:40 UTC
If your site is served by Apache, use ModRewrite directly to filter the referrers. I use it on my personal sites to block access to images except for requests from my own domains. --- The harder I work, the luckier I get.	[reply]
Re^2: blocking site scrapers by BMaximus (Chaplain) on Feb 07, 2006 at 20:01 UTC
Nice idea but referrers can be forged as well. BMaximus	[reply]
Re^3: blocking site scrapers by jhourcle (Prior) on Feb 09, 2006 at 15:01 UTC
Yes, they can -- but if someone's scraping the site, they'd have been referred by the site in question to get to the image. Checking HTTP_REFERER is for those cases when someone from another website decides to link directly to an image (and/or page) on your site. Back in the early days of HTTP (ie, 0.9, before there was such a thing as HTTP_REFERER), it was common for people to link to our imagemap and counter CGIs that ran on the server that I maintained -- they didn't care, and there was no real way to stop them. Likewise, people would find an image they liked (a bullet, some animated gif, whatever), and would link directly to it, sucking down your bandwidth. (the university where I worked only had a T1 in 1994) These days, however, when people check HTTP_REFERER, it's not to stop bots -- it's to stop people from linking directly to the images, so that other people visiting their site use someone else's bandwidth. As they don't have control over the other people's browsers, checking HTTP_REFERER can be a very effective way to cut down on abuse -- however, as not all browsers send HTTP_REFERER, you have to make sure that the null case is to allow the download. ... I'm also surprised that no one's mentioned checking X_FORWARDED_FOR to check for proxies (which should have identified the issue w/ AOL, as well as SQUID and quite a few other proxies) ... there were also some proposals floating about for changing the robot exclusion standards to specify rate limiting and visiting hours, but it's been a decade, and I've never seen any widespread support for them	[reply]
Re^4: blocking site scrapers by BMaximus (Chaplain) on Feb 14, 2006 at 00:40 UTC
Re: blocking site scrapers by mikeraz (Friar) on Feb 08, 2006 at 22:31 UTC
If you own your server and have full access to it … You can insert firewall rules to block access from the offending IP for a set time. Inspired by a presentation on spam handling by merlyn I wrote up a program that monitors log files for unwanted activity and locks out an IP for five minutes when offending activity is detected. This alone cut the number of spams delivered to my system for processing from ~30,000 a day to less than 1,000. It continues to be a great learning exercise that I'll hopefully polish into something real eventually. Be Appropriate && Follow Your Curiosity	[reply]