Email filtering

This is a cool "want to use", as I haven't done it yet. However, I spent $199 on a new Linux server, so I'm comitted :)

Basically, I think that a lot of the features that spammers use to defeat simple keyword filters can themselves be recognised by something a little smarter, like a Perl script.

I kind of like SpamCop, but it has issues with my current mail provider, and doesn't quite do what I want in terms of filtering.

So, I want to write some of my own, to "detect" based on the incoming mail, and to handle the "challange" to see if a live person is on the other end. All in Perl, on a Linux box.

So, besides any pointers from those who have already done something like this, I have a semi-of-topic request: point me in the right direction for good info on this kind of Linux server setup? Specifically, how to properly run the mail server (web server, etc.) and the firewall on the same physical box. And, make this a killer Perl machine in general :)

So this isn't totally OT, let me list some of the filter ideas that relate to regular expressions. On newsgroups, I see words that separate each letter with a non-letter mark, like "g.e.t. r.i.c.h!". Kills keyword filters, but easy to spot in and of itself, with real matching logic (i.e. Perl). Likewise, eliminating duplicate subject strings by adding a serial number, they always use a bunch of spaces (to scroll it off small screens?), and that's trivial to spot. Anything with more !'s than words can bite the dust, since I probably wouldn't want to read it even if it wasn't spam.

Comment on Email filtering

Replies are listed 'Best First'.
Re: Email filtering by gav^ (Curate) on Apr 15, 2002 at 20:11 UTC
There have been some good articles on perl.com Stopping Spam with SpamAssassin http://www.perl.com/pub/a/2002/03/06/spam.html My Life With Spam http://www.perl.com/pub/a/2000/02/spamfilter.html There is also a plethora of modules in the Mail:: namespace that might be helpful. Hope this helps... gav^	[reply]
Re: Re: Email filtering by perrin (Chancellor) on Apr 15, 2002 at 20:34 UTC
SpamAssassin++. I started using it after Matts suggested it on the mod_perl list. It's been working great for me.	[reply]
Re: Email filtering by stephen (Priest) on Apr 15, 2002 at 20:33 UTC
I use mailscanner on all of my incoming mail. Mailscanner uses Mail::SpamAssassin, which does precisely the sort of thing you're looking for, and supplements it with things like RBL and Vipul's Razor. (These are both ways of detecting spam by checking against known spam-senders and known spams out there.) Mailscanner is written in Perl, and is very actively supported. It also will scan your incoming e-mail for viruses, which is a big help. I've been running this setup for about three months, and I've found it works very well. I use fetchmail to download my mail to my internal mailserver (an old Pentium Pro machine), then scan everything with mailscanner. (I also pipe my outgoing mail through mailscanner... if I get a virus through some other means, I will infect no one else.) I use procmail to sort my mail into different folders, including a Spam directory for all autodetected spam. Then I access this from my desktop machine over IMAP. As for the best way to run a firewall and mail services on the same machine-- the various Linux howtos I read indicated that this was a Bad Thing To Do. I'm not an expert on the subject, though. Is there a LinuxMonks out there? stephen	[reply]
Re: Email filtering by hossman (Prior) on Apr 16, 2002 at 06:47 UTC
Even if 99% of why you are looking into this is for the challenge of writting code that can identify spam, don't start from scratch. Become a SpamAssassin Contributor and write new spam tests that fit into the architecture -- that way lots of people can take advantage of your efforts WHILE using SpamAssassin.	[reply]
Re: Email filtering by scottstef (Curate) on Apr 15, 2002 at 21:22 UTC
Just a word of warning before you go and automate using rbl's and the such. In theory they are great, however, they do have their flaws. I have seen mail servers get black listed due to their configurations that made them look like an open relay when in fact they were not. With the RBL's, they just run a quick scan to see if a server will accept anonymous connections. They do not check to see what happens with that connection after it is received (if it is thrown out or not). This may cause you to miss some important emails from people that are incorrectly listed by rbl's. I would suggest filtering purely on content (such as see britney..., get your degree... and the such.) rather than a black hole which can incorrectly filter out email that does not meet their specifications. "The social dynamics of the net are a direct consequence of the fact that nobody has yet developed a Remote Strangulation Protocol." -- Larry Wall	[reply]
Re: Re: Email filtering by John M. Dlugosz (Monsignor) on Apr 25, 2002 at 21:56 UTC
I plan on using a write-back technique where the robot sends mail back to the sender, asking to validate the address by simply replying or clicking a URL. So if someone who happens to have an ISP that's blacklisted by the ORB writes me a message entitled "URGENT FIX !!!" about using my shareware library to calculate low home mortgages, then it still won't be a Black Hole, but will delay to validate the sender has a proper return address and is a real person not a bulk mailer.	[reply]
Re: Email filtering by moodster (Hermit) on Apr 16, 2002 at 09:40 UTC
A word of advice: running your own web server at home is pretty trivial, but running a mail server is not. For a mail server running your home page and maybe a few neat perl applications, uptime isn't really crucial but for a mail server it is. If your box goes down while you're on vacation for a two weeks, then you'll lose two weeks worth of email. Also, configuring a mail server like sendmail is far from trivial... (been there, tried that :) stephen's approach (above) of setting up an internal mail server and use fetchmail to pull mail from a POP account is a lot easier to set up. Assuming you're using a decent ISP it will also be hell of a lot more reliable. No one has mentioned it yet, but Mail::Audit is pretty cool for user-level mail filters. It's a procmail replacement which lets you write perl programs to sort and/or reject mails. Nice, because that means I will never have to use procmail ever again. See also: A Beginner's Guide to Using Mail::Audit and Mail::SpamAssassin Cheers, --Moodster	[reply]
Re: Re: Email filtering by John M. Dlugosz (Monsignor) on Apr 25, 2002 at 22:00 UTC
A friend of mine uses DNS records such that if his mail server is down, it gets delivered to his ISP, which then tries to pass it along when it can. I thought that's the normal way of doing it.	[reply]
Re: Email filtering by talexb (Chancellor) on Apr 16, 2002 at 13:50 UTC
The way I'd set up an E-Mail program would be to start with a list of E-Mail addresses that I trust. Each address would be associated with a folder, so as mail arrived from a known address, it would go into the associated folder. Mail that wasn't recognized would go into a Junk folder, perhaps for examination by some spam tool. If the message is from someone I do know, I rescue their message, add them to my address book and store their message in the associated folder. That's what my approach would be, because the ratio of spam messages to messages from people I've never before written to is very high. Perhaps other people have had different experiences. --t. alex "Nyahhh (munch, munch) What's up, Doc?" --Bugs Bunny	[reply]
Re: Re: Email filtering by hossman (Prior) on Apr 16, 2002 at 17:32 UTC
This is easily achieved using SpamAssassin. (You would also need something like Mail::Audit or procmail to do the folders by sender part.) SpamAssassin is built arround the notion of "tests" which are applied to each msg. Each test has an asociated point value, and if the total point value for a particular msg is above a set threshold, the msg is considered spam. Which tests you use, what point values each test should get, and what your personal threshold are is totally configureable. (Aside: SpamAssassin marks up your msgs indicating which are spam by adding Mime Headers listing which tests it pased and what it's total score was, in addition to putting "spam" in the subject. All of which makes it easy to set up automatic filtering (again, this is all configurable) One of the tests is called "USER_IN_WHITELIST" with a default score of "-100" (ie: if the user is in your whitelist, they have to do a LOT of bad shit in their emails to be considered spam). If you configure SpamAssassin to only use the USER_IN_WHITELIST test, and set your personal threshold at -99, email from anyone you don't know will get flagged as spam (and you can have Mail::Audit or procmail file it into a spam folder for you.) The only step left is to periodicly check your spam folder for mesgs from people who are "ok" but have never been added to your whitelist -- that's easy enough. the -W option of spamassassin will parse a msg for addresses and add them all to your whitelist, Or if your email program allows you to hook into perl methods, the Mail::SpamAssassin API has methods like `add_all_addresses_to_whitelist`.	[reply] [d/l]