in reply to Spam filtering and regular expressions

Rather than relying solely on a regex, you may want to consider something that Thunderbird has an option for - disallowing e-mail from anyone not in an approved list of senders. I've got that option set and it has filtered down spam quite a bit.

HTH!

Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
  • Comment on Re: Spam filtering and regular expressions

Replies are listed 'Best First'.
Re^2: Spam filtering and regular expressions
by jhourcle (Prior) on Jul 30, 2005 at 19:30 UTC

    Whitelisting (only allowing e-mail from known good addresses), can reduce your spam significantly, but it doesn't deal with viruses, and has a rather high rate of false positives (rejecting e-mail that you would have wanted to see ... like maybe that friend from highschool that you've lost track of, or your friend telling you he's been fired from his job and had to switch e-mail addresses)

    The only advantage to acting on the e-mail addresses is that it (well, the envelope-sender, not necessarily what shows up in the 'from' header) is sent before the DATA command in SMTP, so you can reduce bandwidth used by rejecting early. (although, that only works for envelope-from and envelope-to ... and I'm guessing unless the system allows <> (the null e-mail address), you're not going to be losing messages about delivery failures.

    There are a wide variety of methods for attempting to determine if it's UCE, but most of them tend to only get the obvious stuff, or tend to be over greedy, and block legitimate mail. I agree that some regexes suck, but it takes many, many layers to do it well. (if you're going to go the regex rules, you might start by looking at the procmail rules from panix. I'd also recommend looking at spam-l and spam tools.

    I personally find that the best UCE indicator (ie, no false positives, except maybe on spam discussion lists) is when something is obfuscated (octal in IP addresses, HTML w/ hyperlinked urls that don't match the link, javascript to hide the content of the message, etc.)

Re^2: Spam filtering and regular expressions
by fraktalisman (Hermit) on Jul 30, 2005 at 19:23 UTC

    For most of my mail addresses, only allowing mail from senders in a whitelist, would not be an option, for I do want to receive mails from people that I do not already know. I regularly hand out the email address of our skating crew, and of course we get loads of spam, but also most valuable messages. And as most existing spam filtering mechanism provided by the popular web providers (like Spam Assassin) already filtered out newsletters that I had subscribed to, I only use filtering with a very high treshold, so maybe I have to live with loads of spam in my inbox. Sometimes it is even hard to figure out whether a message is spam or not, when looking at the title and sender as a human being, so how should an algorhythm get this right in every case? I think it is impossible as a matter of principle.

      Both you and jhourcle are right. Relying solely on a whitelist is not the way to go. What I was suggesting was to include that as part of a solution.

      I know that with Thunderbird, even though it's using a whitelist, I go through every time I d/l e-mail and see if something got filtered that shouldn't have. The program's learning is part of the solution as well, however it may be beyond the scope of what the op intends.

      Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.