John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

I'm contemplating writing a pre-filter to use on incoming email before it's processed by POPFile. POPFile and tools like it operate on words, so the email must be parsed into word tokens first.

Now you and I know that a Subject containing "D *E *B *T" is spam, but this clue is totally invisible to POPFile.

I'm thinking if a Perl program can detect things like this and put the original word into a header, then that word will be seen by POPFile, and the header itself is also a clue that things like this were detected.

I've long seen words separated out with some punctuation mark, and that's easy to write a simple regexp for. But this is the first time I've seen spaces and punctuation used at the same time.

Rather than staying one step ahead with constant updating, I'm hoping to figure out a general enough algorithm (probably not a simple regexp anymore) that can handle anything like that. False positives are OK. It should find groups of letters that are related as a "word" even though they are not contiguous in a file. The algorithm would handle ordinary acronyms (F.L.A.R.P.) as a special case.

Any ideas?

—John

  • Comment on Spam filtering regexp - keyword countermeasure countermeasure

Replies are listed 'Best First'.
Re: Spam filtering regexp - keyword countermeasure countermeasure
by AssFace (Pilgrim) on May 12, 2003 at 16:12 UTC
    If you want a general algorithm that will handle all kinds of possibilities - then you want something that is ignorant of the actual content and instead does statistical analysis on what it "sees"

    In other words, you want Bayesian analysis done over a structure similar (if not exactly) like a Markov Matrix.

    But if you are going to do that - be warned that you are repeating the work of the very successful SpamAssassin.
    (I've used spamassassin for awhile now and it kicks ass for spam)

    If you are writing something just as a programming exercise, then look into that - if you are writing it to solve your spam problem - then I would first look into SpamAssassin.

    -------------------------------------------------------------------
    There are some odd things afoot now, in the Villa Straylight.

      POPFile of course does use Bayesian analysis and over the few months I am using it now, it catches about 98% of the spam. The ones it didn't catch, were mostly short messages (probably too few words to analyse).

      And of course if you start calling Viagra "Sildenafil Citrate", even the best analyst gets confused (once).

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      I am doing Bayesian analysis, via POPFile. However, that works with words, so any non-word property I can detect elsehow I can add as a "keyword" to be considered by the Bayesian analysis.
        Sorry, I guess I wasn't specific enough.
        What I meant was to do Bayesian analysis on it so that it is totally independant from the language that the text is in.

        You don't want it to be at all aware of the words that it is looking at - instead you want to look at the statistical frequency that sub sections make up. (although technically you could also use sections larger than words - as long as it includes whitespace and characters - you don't want to only use words though)

        For instance trigraphs usually perform well in that respect. You could even break it down to the character level if you want, but that will slow it down considerably.

        To gain the real benefits of Bayesian analysis, you don't want it to be aware of any words at all - that defeats the purpose - or at least doesn't play to its strength.

        I would try playing with it at different levels - bi and trigraphs are going to perform well, but will be slower - looking at five characters at a time might prove to work well - would have to test it all out.

        So you would break a phrase up into the subsections, dump that into your structure (usually a Markov Matrix in the end) and then calculate the weights on it.
        Then you learn on good and bad mail and the structures learn how the weights work for that.
        Then as new mail is compared against that structure, you see what weight that it comes away with and it will then sort out the mail accordingly.

        do note that when you are doing the character analysis - you count every character - including spaces (even multiples in a row) and line breaks.

        In the end, I'm not sure why you would want to do it on your own isntead of just using spamassassin.
        I have used it and went from getting well over 100 spam a day down to never getting spam anymore. (well, I get them, but they get filter out and I never see them)

        -------------------------------------------------------------------
        There are some odd things afoot now, in the Villa Straylight.
Re: Spam filtering regexp - keyword countermeasure countermeasure
by Popcorn Dave (Abbot) on May 12, 2003 at 16:40 UTC
    A couple of the things I've been seeing a lot of lately is using "leet" speak and HTML with lots of <!--Dave in them. Even filtering out the "leet" speak would probably solve a lot of your spam.

    Something simple like /[a-zA-Z]\d/would be cause to flag the message as an offender I would think. Or at the very least it's a start.

    Also, and someone will correct me if I'm mistaken here, I can't see any need to have comments in an HTML e-mail message that contain my name. To me, anytime I see html comments, it's suspect.

    Good luck!

    There is no emoticon for what I'm feeling now.

Re: Spam filtering regexp - keyword countermeasure countermeasure
by CountZero (Bishop) on May 12, 2003 at 19:09 UTC

    I am also using POPFile and I didn't notice that it failed to catch spam with "split" words. Most of the time it seems that there are sufficient other clues to categorize the e-mail as spam or not.

    If it ain't broken, don't repair it!

    However, as it is an interesting question, I would go along the following track:

    • First, delete all "funny" characters (*, #, _, ...).
    • Then collapse all whitespace in between single alphabetic characters (this might catch the occasional 'I' or 'a', but that would be quite rare I think. If it happens too much, you could make an 'I' or 'a' at the beginning of end of a sequence of a 'split' word an exception to the rule. Alas, 'V *I *A *G *R *A' would then become 'VIAGR' and 'A' and 'I *N *T *E *R *E *S *T *R *A *T *E *S D *R *O *P' would make 'I' and 'NTEREST RATES DROP', but POPFile would quickly catch on to these as well)

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      To help trying to avoid inadvertently collapsing a *legitimate* single character, you could maybe use the idea that the "funny" separator is the same between things that should be collapsed. (until the spammers catch on to that)
      It will be a not fun day when they do random grouping of chars in their headers, i.e.:
      viagra = V *IA *GR  **A
      But, ignoring that situation, get the separator, " *", for instance, and remove all occurences. just a thought

        Sorry to be the bearer of bad tidings, but I saw some spam a couple of days ago that compared its product to "vi ag r a".

        Hugo
      That sounds pretty good. Thanks for the pointer.
Re: Spam filtering regexp - keyword countermeasure countermeasure
by pzbagel (Chaplain) on May 12, 2003 at 16:26 UTC

    First off, SPAM and SPAM filtering is an ever-escalating arms race. You should be so lucky to stay one step ahead of them. Are you able to use something like SpamBayes? Bayesian filtering is quickly becoming best way to deal with spam.

    If you cannot run these tools or just plain insist on writing this script, perhaps a good tactic would be to remove all punctuation and spaces from the subject line and then use a list of SPAM-ish words(debt,enlarge,coed) and see if they are contained in the subject line. However, this idea will not handle ordinary acronyms. Another tactic might be to take that same list of SPAM-ish words and do stuff like inserting a check for non-alpha characters between each letter:

    /d[^A-Za-z]*e[^A-Za-z]*b[^A-Za-z]*t/i

    But that's woefully inefficient.

    Just my 0.02

Re: Spam filtering regexp - keyword countermeasure countermeasure
by tachyon (Chancellor) on May 13, 2003 at 06:42 UTC

    This discussion of Baysian spam filtering should be of use http://www.paulgraham.com/spam.html

    We run a Baysian web filter that uses phrases. A phrase is a 1, 2 or 3 word token. The tokens are contigous character sets (generally words) but also URL domain links and some other bits and bobs. We use phrases rather than single words as this adds context.

    Everything is automated, we don't add anything by hand. The system is designed to work hands free. The data sets used to generate these were large (ie 50,000) and the DB (with all single instances removed) covers about 1/2 a million such token phrases. As you get more data you simply re-run the phrase/probability generator which will then take account of new phrases that are commonly appearing in your target content.

    We use Math::BigFloat as if you use too many phrases your probabilities blow off the ends of the floating point accuracy. Optimal range for tokens to do the Baysian on is (in our testing) 8-20 depending on the data set, if you don't use Big::Float. We pick the phrases that offer the greatest differential (ie good-bad probability difference). Accuracy can be very impressive. We run a sensitivity of > 99.7% and a specificity of 99.9% for porn for example.

    The probabilities tend to move rapidly either towards 0 or one. A probability level > 0.9 works well in practice but even > 0.5 is still remarkably accurate.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Hmm, one reply sais to use individual chars, and you use groups of words. I can see how that would work, in that D,E,B and E,B,T are both 3-token groups that will be found.

      So I get the feeling that using Baysian analysis on single whole words (e.g. POPFile) is the worst way to do it!

      My idea is to add more "context" then POPFile can gleem by itself, by adding special keywords when the preliminary filter spots things.

        Combining tokens gives context. It allows you to differentiate to an extent between 'are you free tonight' and 'debt free' 'free widgets' etc.

        If you just run on 'free' as a word you lose sensitivity as this is quite common. Using consecutive tokens is the way to go IMHO and is the method employed in voice recognition (Dragon used to use 2 words for context and IBM 3 I believe).

        The price is size and speed based. To give you and idea our single word token file is ~ 100K, the two word token file is ~ 10MB and the three word token file > 1GB You will note a rough 2 order of magnitude increase in size as you add length to the phrases.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Spam filtering regexp - keyword countermeasure countermeasure
by Abigail-II (Bishop) on May 12, 2003 at 20:06 UTC
    False positives are OK.

    In that case:

    sub is_spam { my $mail = shift; return 1; }

    Abigail