Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Benevolent Ad Filter

by httptech (Chaplain)
on Jun 08, 2000 at 01:28 UTC ( [id://16964]=CUFP: print w/replies, xml ) Need Help??

There are all kinds of methods for blocking banner ads. However, I feel guilty about blocking ads from the smaller sites that depend on the advertising income to stay afloat. I'm getting the benefit of the site, but the sponsors don't see any hits on their ads, so the site operator gets short-changed by my filter.

I've used Perl to rectify this. I have set up Apache on my local machine, and enabled mod_rewrite with a rule like this in my httpd.conf:

RewriteEngine on RewriteRule ^/.*$ /blank.cgi
Now, in /etc/hosts, I have a list of servers who only serve ad content, similar to this:
127.0.0.1 adfu.blockstackers.com 127.0.0.1 ad.doubleclick.net 127.0.0.1 m.doubleclick.net
Now, when a request for any URL at one of the above servers is made by my browser, it is diverted to my local http server, and my custom CGI script. The CGI script makes a connection to the ad server, figures out whether the content is supposed to be text or an image, and then generates a blank page or transparent gif for my browser to use in place of the ad. It closes the connection before actually downloading the ad.

The net result: I don't have to waste bandwidth on banner ads, and the site operator still has a "GET" request for the ad in their logs, so they still get paid (assuming they get paid for page views instead of click-throughs)

I am posting the code as a reply to this message.

Replies are listed 'Best First'.
RE: Benevolent Ad Filter
by httptech (Chaplain) on Jun 08, 2000 at 01:30 UTC
    #!/usr/bin/perl use CGI::Carp qw(fatalsToBrowser); use Net::DNS; use Socket; my $url = my $path = $ENV{'REQUEST_URI'}; my $host = $ENV{'HTTP_HOST'}; my $referrer = $ENV{'HTTP_REFERER'}; my $useragent = $ENV{'USER_AGENT'}; my $content_type; $path =~ s#/([^/])*$##; GETCONTENT: $content_type = &get_content_type; &print_gif if $content_type =~ /image/; &print_text; exit; sub print_text { print "Content-type: $content_type\n\n<!-- -->\n"; ex +it } sub print_gif { # Prints a transparent GIF. Adapted from code created by Turnstep: # http://www.perlmonks.org/index.pl?node_id=7974 print "Content-Length: 43\nContent-type: image/gif\n\n"; printf "GIF89a\1\0\1\0%c\0\0%c%c%c\0\0\0%s,\0\0\0\0\1\0\1\0\0%c%c% +c\1\0;", 144,0,0,0,1?pack("c8",33,249,4,5,16,0,0,0):"",2,2,4; exit; } sub nslookup { my $host = shift; my $dns = new Net::DNS::Resolver; my $query = $dns->search($host); if ($query) { for $rr ($query->answer) { next unless $rr->type eq "A"; return $rr->address; } } else { die "lookup of $host failed: ", $dns->errorstring, "\n"; } } sub get_content_type { my ($type, $location); my $addr = &nslookup($host); my $proto = getprotobyname('tcp'); my $inet = inet_aton($addr); socket(S,PF_INET,SOCK_STREAM,$proto) || die "Couldn't open socket: + $!"; if (connect(S,pack "SnA4x8",2,80,$inet)) { select (S); $| = 1; print "GET $url HTTP/1.0\r\n"; print "Referer: $referrer\r\n"; print "User-Agent: $useragent\r\n\r\n"; while (<S>) { if (/^Content-type:\s*(.*)(?:\;|\n)/i) { $type = $1 } if (/^Location:\s*(.*)$/i) { $location = $1 } last if /^\s*$/; # Close the connection after getting the +headers } select (STDOUT); close S; } else { die "Couldn't connect to $host : $!" } if ($location) { # We've been redirected if ($location =~ m#^http://([^/]*)/(.*)$#) { $host = $1; $url = "/$2"; } else { $url = ($location =~ /^$path/o) ? $location : "$path/$loca +tion"; } goto GETCONTENT; } return $type; }
      This is somewhat related: I recently installed FilterProxy, a Proxy http server written in Perl of course. What's quite nice about it is that one can point one's web browser at the proxy port and configure all the regexes used, etc. *excellent* program if you don't want to install apache to use this one. The only real advantage to punk music is that nobody can whistle it.
        Actually you really don't need to install Apache to use this. You could use Miniweb, the standalone single-CGI script http server that runs from inetd.
RE: Benevolent Ad Filter
by httptech (Chaplain) on Jun 09, 2000 at 02:31 UTC
    By the way, here's the complete list of blocked sites from my /etc/hosts:
    127.0.0.1 adfu.blockstackers.com 127.0.0.1 ad.doubleclick.net 127.0.0.1 m.doubleclick.net 127.0.0.1 ad.webprovider.com 127.0.0.1 image.linkexchange.com 127.0.0.1 jeeves.flycast.com 127.0.0.1 www.flycast.com 127.0.0.1 www.burstmedia.com 127.0.0.1 www.247media.com 127.0.0.1 www.ad-venture.com 127.0.0.1 www.adauction.com 127.0.0.1 www.adsdaq.com 127.0.0.1 a32.g.a.yimg.com 127.0.0.1 www.pagecount.com 127.0.0.1 www1.pagecount.com 127.0.0.1 www2.pagecount.com 127.0.0.1 www3.pagecount.com 127.0.0.1 www4.pagecount.com 127.0.0.1 ad.linkexchange.com.com 127.0.0.1 www.smartclicks.com 127.0.0.1 mojofarm.mediaplex.com 127.0.0.1 www.etour.com 127.0.0.1 netadsrv.iworld.com
    If you come across any other "filter-worthy" servers, let me know!

      At www.waldherr.org, there is a blocklist for the Internet Junkbuster, another blocking proxy. I guess that if you just take the top part, where servers are blocked completely, you will get a good list of servers-to-be-blocked :). Of course, the IJB does a bit more, it does regex-based blocking of URLs, but we can always add this here as well ;).

RE: Benevolent Ad Filter
by merlyn (Sage) on Jun 08, 2000 at 05:03 UTC
      Interesting. But it looks like there are some important differences:

      The mod_perl version proxies everything, not just ad servers. However it only blocks images; sometimes ads come in the form of javascript or even java. But they usually get sent from the same server for tracking purposes, so my script will block all ad content from a given server. (You could probably alter the mod_perl version to do this though)

      The mod_perl version actually retrieves the entire file it blocks, which I think is a waste of bandwidth, but you're forced into that if you use LWP (as far as I know). That's why I use the Socket module, and close the connection as soon as I have the headers. The trade-off for this is my version will not work through another proxy server.

        You can, in fact, use LWP to load just the first part of the GET request, by using a content-callback handler that throws an exception, cutting off any further action. Quoting from perldoc LWP::UserAgent:
        The request can be aborted by calling die() in the call- back routine. The die message will be available as the "X-Died" special response header field.

        -- Randal L. Schwartz, Perl hacker

        and close the connection as soon as I have the headers

        #!/usr/bin/perl use LWP::Simple; if (head('http://www.foo.com/')) { print "Page exists and would download fine!\n"; }
        In list context, head returns all kinds of interesting values such as response code, last modified, content-length etc.
        The reason for not using HEAD is simple; it shows up in the logs as a HEAD request and not a GET. If I were a advertiser I would be not count HEAD requests as legitimate page views, since its clear that the ad was never actually viewed.
RE: Benevolent Ad Filter
by swiftone (Curate) on Jun 08, 2000 at 21:57 UTC
    My network admin noticed that my machine was sucking up huge amounts of bandwidth (due to leaving one or two netscape windows on PerlMonks all night/over the weekend). Thus I have installed Squid and am using it as a proxy to block all ad banners, but now I feel guilty over costing PerlMonks and other sites their ad revenue. This script should work fine as a Squid Redirect script (No apache modifications necessary).

    On another note, someone who didn't want to screw up their Apache server could set up a Virtual host to do this.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://16964]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-03-28 13:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found