Sorry, I guess I wasn't specific enough.
What I meant was to do Bayesian analysis on it so that it is totally independant from the language that the text is in.

You don't want it to be at all aware of the words that it is looking at - instead you want to look at the statistical frequency that sub sections make up. (although technically you could also use sections larger than words - as long as it includes whitespace and characters - you don't want to only use words though)

For instance trigraphs usually perform well in that respect. You could even break it down to the character level if you want, but that will slow it down considerably.

To gain the real benefits of Bayesian analysis, you don't want it to be aware of any words at all - that defeats the purpose - or at least doesn't play to its strength.

I would try playing with it at different levels - bi and trigraphs are going to perform well, but will be slower - looking at five characters at a time might prove to work well - would have to test it all out.

So you would break a phrase up into the subsections, dump that into your structure (usually a Markov Matrix in the end) and then calculate the weights on it.
Then you learn on good and bad mail and the structures learn how the weights work for that.
Then as new mail is compared against that structure, you see what weight that it comes away with and it will then sort out the mail accordingly.

do note that when you are doing the character analysis - you count every character - including spaces (even multiples in a row) and line breaks.

In the end, I'm not sure why you would want to do it on your own isntead of just using spamassassin.
I have used it and went from getting well over 100 spam a day down to never getting spam anymore. (well, I get them, but they get filter out and I never see them)

-------------------------------------------------------------------
There are some odd things afoot now, in the Villa Straylight.

In reply to Re: Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by AssFace
in thread Spam filtering regexp - keyword countermeasure countermeasure by John M. Dlugosz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.