Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a users online script that records user's IP addresses and I have an image gallery site. Lately I've had a crap load of kids with bots scraping my site and taking all my images.. thus sucking the bandwidth out from under me.

I could block IP ranges every 5 seconds I think but the problem is, I don't know much about IPs. I know they can be altered but here's the thing. I looked at my log while being scraped and the first three sets of the IP are the same but the last set changes.

Does the last set determine if it's the same person or not? Or is it guaranteed to be the same person?

Ie.

64.12.116.67 64.12.116.202 64.12.116.201
This is the real IP addresses. Now I think this is the same person but I'm not 100% sure. They are online at the same time.

So can I create a Perl script to filter out IPs that match the first three sets or does the last set matter, too?

Sorry this is more of a technical question, but I need to figure something out before I use up all my bandwidth and my host beats me up.

Also, there's no need to reiterate that IP addresses can be spoofed and there's no 100% way. I know that.

Replies are listed 'Best First'.
Re: blocking site scrapers
by mirod (Canon) on Feb 07, 2006 at 08:25 UTC
      If I am using a web host, could I install mod_throttle on my own directory? Or is that specifically for root?

        If mod_throttle is not already installed, then no, you can't use it. That's where you turn to the second link I gave above ;--)

      There's also mod_bwshare, though I haven't used it myself.
Re: blocking site scrapers
by chargrill (Parson) on Feb 07, 2006 at 04:33 UTC

    AOL user(s):

    $ whois 64.12.116.201 OrgName: America Online, Inc. OrgID: AMERIC-158 Address: 10600 Infantry Ridge Road City: Manassas StateProv: VA PostalCode: 20109 Country: US NetRange: 64.12.0.0 - 64.12.255.255 CIDR: 64.12.0.0/16 NetName: AOL-MTC NetHandle: NET-64-12-0-0-1 Parent: NET-64-0-0-0-0 NetType: Direct Assignment NameServer: DNS-01.NS.AOL.COM NameServer: DNS-02.NS.AOL.COM Comment: RegDate: 1999-12-13 Updated: 1999-12-16 RTechHandle: AOL-NOC-ARIN RTechName: America Online, Inc. RTechPhone: +1-703-265-4670 RTechEmail: domains@aol.net

    But, here's your problem. AOL users route through always rotating proxies, so that's PROBABLY the same user. HOWEVER, there's no guarantee that they'll come from 64.12.116.x the next time they decide to scrape your site.

    Given that they're likely from AOL, no - I doubt they can spoof their source IP

    Do notice, however, that it's a /16 - you could technically block THAT entire range. but that might limit your audience more than you'd like.



    --chargrill
    $/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ } + sig map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu +" );
      I'm not going to "block" anyone. My idea is to setup a script that will kill if the refresh is too quick. I'd record their IP in a database along with a timestamp that they were last seen. If they try to reload a page in X seconds, the rest of the page won't load for 5 seconds. This will cut back bots and may even get them to stop, hopefully.

      But this would also filter search engine bots, too. So I'm stuck :(

        Well, that certainly makes more sense than say, dynamically altering firewall rules (yes, I've seen that). :)

        A well behaved search engine bot SHOULD be discernable by their UA (doubt the script kiddies bother to change theirs), and you may want to note whether a client requests or has requested /robots.txt...

        Granted, none of this is a sure thing, but a combination of "tests" may get you close enough to what you want without restricting others...



        --chargrill
        $/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ } + sig map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu +" );

        You could start to build a second database (or add a field in the present one) that would include IP numbers that requested robots.txt, or that identified themselves as Googlebot, SurveyBot, Yahoo!, ysearch, sohu-search, msnbot, RufusBot, netcraft.com, MMCrawler, Teoma, ConveraMultimediaCrawler, and whatever else seems to be reputable.

        My main criterion for a bot being OK is if it asks for robots.txt. However, this isn't 100% reliable. There's a bot out there that uses robots.txt to scrape only the forbidden directories and pages, ignoring the allowed ones. It's called WebVulnScan or WebVulnCrawl. That's just plain rude.

        But just a thought - if a search bot is burning your bandwidth, isn't that still something you'd want to avoid?

        If you have Mod_Perl Installed on your server you could use this technique given in the mod_perl book
        Blocking Greedy Clients
Re: blocking site scrapers
by spiritway (Vicar) on Feb 07, 2006 at 03:07 UTC

    I've got the same problem, except that I also get attempted exploits (infected servers). It seems that your IP numbers all start out the same, and only the last octet varies. The chances are that the IP numbers are being pulled from a pool as needed, and then returned when not needed any more. This means that your miscreant might have different IP numbers each time (s)he signs on. Thus, I would block the whole range of IP numbers, 64.12.116.0 through 64.12.116.255. You run a slight risk of blocking out an innocent party, but it's probably worth it to save your bandwidth.

    You may also want to use whois or traceroute (or tracert if you're using Windows) to find out whose server it is. You can get the owner's contact information from whois, and notify them of the problem. They may take steps to stop this from happening, depending on how ethical they are.

    As for spoofing IP numbers, it's not likely that someone would go to that much bother, just to steal some photos. And I'm wondering why they have to keep scraping, if they've already got them. Sounds kind of dumb to me - or else, they've written a very rude bot.

Re: blocking site scrapers
by DrHyde (Prior) on Feb 07, 2006 at 10:41 UTC
    The technique I talked about in Re: Re: Password hacker killer is probably going to be useful. I imagine you'll also need to have things decay out of your block list. Or rather, as you don't want to inconvenience real users too much, just skr1pt kiddies, I'd make it a delay list - if someone's in the list, make their downloads sloooooow.

    As someone else has mentioned, those particular addresses are AOL proxies, which indicates that you may want to score whole address ranges instead of individual addresses. This snippet from one of the scripts I use when hunting spammers will help.

    use Net::DNS; my $IP = '64.12.116.67'; my($ASN, $network, $network_bits) = @{ Net::DNS::Resolver->new() ->query(join('.', reverse(split(/\./, $IP))).".asn.routeviews.or +g", "TXT", "IN") ->{answer}->[0]->{char_str_list} }; print "$network/$network_bits\n";
Re: blocking site scrapers
by pboin (Deacon) on Feb 07, 2006 at 14:07 UTC

    There's a lot of things to think about here, as other monks have well-noted. One thing I could add for you to think about: You could also inadvertently block a router that's doing NAT for an entire organization. Everyone behind that router would appear to come from the same IP address in your logs. You may end up deciding that blocking a whole organization is OK, but at least consider what you're dealing with.

    One of the more clever ways to stop robots, IMO is to have a tarpit link / picture that triggers a penalty period. Bots are dumb, and they'll fall for it every time, unless a human codes around your particular tarpit.

    My favorite example is the tarpit for SQLite on their wiki.

Re: blocking site scrapers
by Cody Pendant (Prior) on Feb 07, 2006 at 04:27 UTC
Re: blocking site scrapers
by monarch (Priest) on Feb 07, 2006 at 06:22 UTC
    Seeing as it is a photo site is adding a small captcha image going to be a problem during login?
Re: blocking site scrapers
by lima1 (Curate) on Feb 07, 2006 at 19:19 UTC
    maybe you could make some tests if you can't use the mentioned apache modules:

    $points = 0;

    whats the referer? does the user came from your homepage? good sign: $points++

    is the useagent something you know (Internet Exploder, Gecko, Opera, Google) $points++

    supports user agent javascript (most spambots don't do this) $points++

    ip requested no page in the last x secs? $points++;

    if ($points > 1) { show_gallery(); }
Re: blocking site scrapers
by Xenograg (Scribe) on Feb 07, 2006 at 17:40 UTC
    If your site is served by Apache, use ModRewrite directly to filter the referrers. I use it on my personal sites to block access to images except for requests from my own domains.

    --- The harder I work, the luckier I get.

      Nice idea but referrers can be forged as well.

      BMaximus

        Yes, they can -- but if someone's scraping the site, they'd have been referred by the site in question to get to the image.

        Checking HTTP_REFERER is for those cases when someone from another website decides to link directly to an image (and/or page) on your site. Back in the early days of HTTP (ie, 0.9, before there was such a thing as HTTP_REFERER), it was common for people to link to our imagemap and counter CGIs that ran on the server that I maintained -- they didn't care, and there was no real way to stop them.

        Likewise, people would find an image they liked (a bullet, some animated gif, whatever), and would link directly to it, sucking down your bandwidth. (the university where I worked only had a T1 in 1994)

        These days, however, when people check HTTP_REFERER, it's not to stop bots -- it's to stop people from linking directly to the images, so that other people visiting their site use someone else's bandwidth. As they don't have control over the other people's browsers, checking HTTP_REFERER can be a very effective way to cut down on abuse -- however, as not all browsers send HTTP_REFERER, you have to make sure that the null case is to allow the download.

        ...

        I'm also surprised that no one's mentioned checking X_FORWARDED_FOR to check for proxies (which should have identified the issue w/ AOL, as well as SQUID and quite a few other proxies) ... there were also some proposals floating about for changing the robot exclusion standards to specify rate limiting and visiting hours, but it's been a decade, and I've never seen any widespread support for them

Re: blocking site scrapers
by mikeraz (Friar) on Feb 08, 2006 at 22:31 UTC

    If you own your server and have full access to it …
    You can insert firewall rules to block access from the offending IP for a set time.

    Inspired by a presentation on spam handling by merlyn I wrote up a program that monitors log files for unwanted activity and locks out an IP for five minutes when offending activity is detected. This alone cut the number of spams delivered to my system for processing from ~30,000 a day to less than 1,000.

    It continues to be a great learning exercise that I'll hopefully polish into something real eventually.

    Be Appropriate && Follow Your Curiosity