This snippet fetches spam black list from 'news.admin.net-abuse.sightings' archived on google groups.
#!/usr/bin/perl use WWW::Google::Groups; $agent = new WWW::Google::Groups(server => 'http://groups.google.com') +; $group = $agent->select_group('news.admin.net-abuse.sightings'); $threshold = $ARGV[0] || 20; $cnt = 0; open $blacklist, ">blacklist"; open $blacklist_log, ">blacklist_log"; while( $thread = $group->next_thread() ){ while( $article = $thread->next_article() ){ $body = $article->body(); if($body =~ /^From: .+$/mo){ if( $& =~ /[\b<](.+?@.+)[\b>]/o ){ print {$blacklist} $1,$/; $cnt++; } print {$blacklist_log} join( q/ /, $thread->title(), '=>', + $1),$/; } last; } last if $cnt >= $threshold; }

Replies are listed 'Best First'.
Re: get spam email list from google
by xern (Beadle) on Jan 30, 2004 at 07:18 UTC
    I just also put a blacklist which is constantly being updated. And hope it can save your time fetching the list from google. It is now available here

      Thanks for putting that link up, it saved me from having to run the code to go and have a look. Here's my point of view:

      That list is pretty worthless. There is a whole pile of addresses that I assume are spoofed freemailers. All those yahoo.com addresses for a start, and there are a large number of domains run by Outblaze in there as wll. There are better ways of dealing with that.

      There are also a few MAILER-DAEMON addresses. Block them, and you're going to lose the ability to receive bounces from those domains. If you have mail routing problems you won't ever know.

      The final futility of blocking based on the sender is that most spam engines simply take a valid domain name, and generate 4-16 random characters for the left hand side. (Hmm, there's this language I've heard about that would be ideally suited to that kind of task :) Just because someone else saw (and reported) an address, doesn't mean you'll ever see it. Quite the opposite in fact. They are many things, but spammers are not stupid.

      If you are serious about spam, you don't care about the sender. Forging the envelope sender is trivial. Much more interesting is to get the IP address of the host that sent you the garbage. Don't shoot the message, shoot the messenger!

        I have to fully agree with your post. Blocking spam based on the "From"-header is next to useless, for most spam is spoofed with randomly created addresses. But besides that, you also might block legit mail from poor people whose e-mail address got hijacked (read: (ab)used as from header in spam runs). Spammers might run out of luck with this approach, according to this great story, but still ;)

        --
        b10m

        All code is usually tested, but rarely trusted.
        I see that. I am considering that maybe a public spam database might be set up, offering titles, contents, and other possible information for people who are interested in doing some mining research in killing spams.