in reply to Spam filtering regexp - keyword countermeasure countermeasure

If you want a general algorithm that will handle all kinds of possibilities - then you want something that is ignorant of the actual content and instead does statistical analysis on what it "sees"

In other words, you want Bayesian analysis done over a structure similar (if not exactly) like a Markov Matrix.

But if you are going to do that - be warned that you are repeating the work of the very successful SpamAssassin.
(I've used spamassassin for awhile now and it kicks ass for spam)

If you are writing something just as a programming exercise, then look into that - if you are writing it to solve your spam problem - then I would first look into SpamAssassin.

-------------------------------------------------------------------
There are some odd things afoot now, in the Villa Straylight.
  • Comment on Re: Spam filtering regexp - keyword countermeasure countermeasure

Replies are listed 'Best First'.
Re: Re: Spam filtering regexp - keyword countermeasure countermeasure
by CountZero (Bishop) on May 12, 2003 at 18:42 UTC

    POPFile of course does use Bayesian analysis and over the few months I am using it now, it catches about 98% of the spam. The ones it didn't catch, were mostly short messages (probably too few words to analyse).

    And of course if you start calling Viagra "Sildenafil Citrate", even the best analyst gets confused (once).

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Re: Spam filtering regexp - keyword countermeasure countermeasure
by John M. Dlugosz (Monsignor) on May 12, 2003 at 19:10 UTC
    I am doing Bayesian analysis, via POPFile. However, that works with words, so any non-word property I can detect elsehow I can add as a "keyword" to be considered by the Bayesian analysis.
      Sorry, I guess I wasn't specific enough.
      What I meant was to do Bayesian analysis on it so that it is totally independant from the language that the text is in.

      You don't want it to be at all aware of the words that it is looking at - instead you want to look at the statistical frequency that sub sections make up. (although technically you could also use sections larger than words - as long as it includes whitespace and characters - you don't want to only use words though)

      For instance trigraphs usually perform well in that respect. You could even break it down to the character level if you want, but that will slow it down considerably.

      To gain the real benefits of Bayesian analysis, you don't want it to be aware of any words at all - that defeats the purpose - or at least doesn't play to its strength.

      I would try playing with it at different levels - bi and trigraphs are going to perform well, but will be slower - looking at five characters at a time might prove to work well - would have to test it all out.

      So you would break a phrase up into the subsections, dump that into your structure (usually a Markov Matrix in the end) and then calculate the weights on it.
      Then you learn on good and bad mail and the structures learn how the weights work for that.
      Then as new mail is compared against that structure, you see what weight that it comes away with and it will then sort out the mail accordingly.

      do note that when you are doing the character analysis - you count every character - including spaces (even multiples in a row) and line breaks.

      In the end, I'm not sure why you would want to do it on your own isntead of just using spamassassin.
      I have used it and went from getting well over 100 spam a day down to never getting spam anymore. (well, I get them, but they get filter out and I never see them)

      -------------------------------------------------------------------
      There are some odd things afoot now, in the Villa Straylight.
        So, Spamassasin works on characters, not words?

        Maybe I'll try that as an alternative to POPFile, if it runs locally and on Windows.

        —John