Re: blocking site scrapers
by mirod (Canon) on Feb 07, 2006 at 08:25 UTC
|
| [reply] |
|
|
If I am using a web host, could I install mod_throttle on my own directory? Or is that specifically for root?
| [reply] |
|
|
| [reply] |
|
|
There's also mod_bwshare, though I haven't used it myself.
| [reply] |
Re: blocking site scrapers
by chargrill (Parson) on Feb 07, 2006 at 04:33 UTC
|
$ whois 64.12.116.201
OrgName: America Online, Inc.
OrgID: AMERIC-158
Address: 10600 Infantry Ridge Road
City: Manassas
StateProv: VA
PostalCode: 20109
Country: US
NetRange: 64.12.0.0 - 64.12.255.255
CIDR: 64.12.0.0/16
NetName: AOL-MTC
NetHandle: NET-64-12-0-0-1
Parent: NET-64-0-0-0-0
NetType: Direct Assignment
NameServer: DNS-01.NS.AOL.COM
NameServer: DNS-02.NS.AOL.COM
Comment:
RegDate: 1999-12-13
Updated: 1999-12-16
RTechHandle: AOL-NOC-ARIN
RTechName: America Online, Inc.
RTechPhone: +1-703-265-4670
RTechEmail: domains@aol.net
But, here's your problem. AOL users route through always rotating proxies, so that's PROBABLY the same user. HOWEVER, there's no guarantee that they'll come from 64.12.116.x the next time they decide to scrape your site.
Given that they're likely from AOL, no - I doubt they can spoof their source IP
Do notice, however, that it's a /16 - you could technically block THAT entire range. but that might limit your audience more than you'd like.
--chargrill
$/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ }
+ sig
map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu
+" );
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
Well, that certainly makes more sense than say, dynamically altering firewall rules (yes, I've seen that). :)
A well behaved search engine bot SHOULD be discernable by their UA (doubt the script kiddies bother to change theirs), and you may want to note whether a client requests or has requested /robots.txt...
Granted, none of this is a sure thing, but a combination of "tests" may get you close enough to what you want without restricting others...
--chargrill
$/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ }
+ sig
map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu
+" );
| [reply] [d/l] |
|
|
|
|
|
|
You could start to build a second database (or add a field in the present one) that would include IP numbers that requested robots.txt, or that identified themselves as Googlebot, SurveyBot, Yahoo!, ysearch, sohu-search, msnbot, RufusBot, netcraft.com, MMCrawler, Teoma, ConveraMultimediaCrawler, and whatever else seems to be reputable.
My main criterion for a bot being OK is if it asks for robots.txt. However, this isn't 100% reliable. There's a bot out there that uses robots.txt to scrape only the forbidden directories and pages, ignoring the allowed ones. It's called WebVulnScan or WebVulnCrawl. That's just plain rude.
But just a thought - if a search bot is burning your bandwidth, isn't that still something you'd want to avoid?
| [reply] |
|
|
|
|
|
|
If you have Mod_Perl Installed on your server you could use this technique given in the mod_perl book
Blocking Greedy Clients
| [reply] |
Re: blocking site scrapers
by spiritway (Vicar) on Feb 07, 2006 at 03:07 UTC
|
I've got the same problem, except that I also get attempted exploits (infected servers). It seems that your IP numbers all start out the same, and only the last octet varies. The chances are that the IP numbers are being pulled from a pool as needed, and then returned when not needed any more. This means that your miscreant might have different IP numbers each time (s)he signs on. Thus, I would block the whole range of IP numbers, 64.12.116.0 through 64.12.116.255. You run a slight risk of blocking out an innocent party, but it's probably worth it to save your bandwidth.
You may also want to use whois or traceroute (or tracert if you're using Windows) to find out whose server it is. You can get the owner's contact information from whois, and notify them of the problem. They may take steps to stop this from happening, depending on how ethical they are.
As for spoofing IP numbers, it's not likely that someone would go to that much bother, just to steal some photos. And I'm wondering why they have to keep scraping, if they've already got them. Sounds kind of dumb to me - or else, they've written a very rude bot.
| [reply] |
Re: blocking site scrapers
by DrHyde (Prior) on Feb 07, 2006 at 10:41 UTC
|
The technique I talked about in Re: Re: Password hacker killer is probably going to be useful. I imagine you'll also need to have things decay out of your block list. Or rather, as you don't want to inconvenience real users too much, just skr1pt kiddies, I'd make it a delay list - if someone's in the list, make their downloads sloooooow.
As someone else has mentioned, those particular addresses are AOL proxies, which indicates that you may want to score whole address ranges instead of individual addresses. This snippet from one of the scripts I use when hunting spammers will help.
use Net::DNS;
my $IP = '64.12.116.67';
my($ASN, $network, $network_bits) = @{
Net::DNS::Resolver->new()
->query(join('.', reverse(split(/\./, $IP))).".asn.routeviews.or
+g", "TXT", "IN")
->{answer}->[0]->{char_str_list}
};
print "$network/$network_bits\n";
| [reply] [d/l] |
Re: blocking site scrapers
by pboin (Deacon) on Feb 07, 2006 at 14:07 UTC
|
There's a lot of things to think about here, as other monks have well-noted. One thing I could add for you to think about: You could also inadvertently block a router that's doing NAT for an entire organization. Everyone behind that router would appear to come from the same IP address in your logs. You may end up deciding that blocking a whole organization is OK, but at least consider what you're dealing with.
One of the more clever ways to stop robots, IMO is to have a tarpit link / picture that triggers a penalty period. Bots are dumb, and they'll fall for it every time, unless a human codes around your particular tarpit.
My favorite example is the tarpit for SQLite on their wiki.
| [reply] |
Re: blocking site scrapers
by Cody Pendant (Prior) on Feb 07, 2006 at 04:27 UTC
|
| [reply] |
Re: blocking site scrapers
by monarch (Priest) on Feb 07, 2006 at 06:22 UTC
|
Seeing as it is a photo site is adding a small captcha image going to be a problem during login? | [reply] |
Re: blocking site scrapers
by lima1 (Curate) on Feb 07, 2006 at 19:19 UTC
|
maybe you could make some tests if you can't use the mentioned apache modules:
$points = 0;
whats the referer? does the user came from your homepage? good sign: $points++
is the useagent something you know (Internet Exploder, Gecko, Opera, Google) $points++
supports user agent javascript (most spambots don't do this) $points++
ip requested no page in the last x secs? $points++;
if ($points > 1) {
show_gallery();
}
| [reply] [d/l] |
Re: blocking site scrapers
by Xenograg (Scribe) on Feb 07, 2006 at 17:40 UTC
|
| [reply] |
|
|
Nice idea but referrers can be forged as well.
| [reply] |
|
|
Yes, they can -- but if someone's scraping the site, they'd have been referred by the site in question to get to the image.
Checking HTTP_REFERER is for those cases when someone from another website decides to link directly to an image (and/or page) on your site. Back in the early days of HTTP (ie, 0.9, before there was such a thing as HTTP_REFERER), it was common for people to link to our imagemap and counter CGIs that ran on the server that I maintained -- they didn't care, and there was no real way to stop them.
Likewise, people would find an image they liked (a bullet, some animated gif, whatever), and would link directly to it, sucking down your bandwidth. (the university where I worked only had a T1 in 1994)
These days, however, when people check HTTP_REFERER, it's not to stop bots -- it's to stop people from linking directly to the images, so that other people visiting their site use someone else's bandwidth. As they don't have control over the other people's browsers, checking HTTP_REFERER can be a very effective way to cut down on abuse -- however, as not all browsers send HTTP_REFERER, you have to make sure that the null case is to allow the download.
...
I'm also surprised that no one's mentioned checking X_FORWARDED_FOR to check for proxies (which should have identified the issue w/ AOL, as well as SQUID and quite a few other proxies) ... there were also some proposals floating about for changing the robot exclusion standards to specify rate limiting and visiting hours, but it's been a decade, and I've never seen any widespread support for them
| [reply] |
|
|
Re: blocking site scrapers
by mikeraz (Friar) on Feb 08, 2006 at 22:31 UTC
|
If you own your server and have full access to it …
You can insert firewall rules to block access from the offending IP for a set time.
Inspired by a presentation on spam handling by merlyn I wrote up a program that monitors log files for unwanted activity and locks out an IP for five minutes when offending activity is detected. This alone cut the number of spams delivered to my system for processing from ~30,000 a day to less than 1,000.
It continues to be a great learning exercise that I'll hopefully polish into something real eventually.
Be Appropriate && Follow Your Curiosity
| [reply] |