in reply to Filtering SPAM hot words from Message Post

Three of the most common types of spam I get are

(1) attempts to get my personal details by telling me there are millions of dollars waiting for me

(2) a random word generated style

(3) mailorder (and in your case other) advertisements

The first category is always a complicated story but at least it can be identified by the expected list of information being asked for in some section of the spam. The second type can only be identified by excess of grammatical error and the third by the kind of site it links to.

So I would spec. the following functionality:

- for type 1: fuzzy matching on lists of personal details placeholders (e.g. "name, address, \w*phone, etc.") - see perlre - i.e. has to be hand-rolled as far as I know.

- Lingua::EN is a space containing several modules that can parse English grammar in different ways - 30/30 fails or some such poor score and you have detected type (2) above. (Update: I took another look at such a spam and these are recently getting grammatically clever where they used to be random words, so a score of as high as 95% grammatically correct might now need to be deemed a spam!!)

- for type 3, you are generally looking for links to URLs that use spam to advertise. SpamAssassin can help maintain a list of blocked URLs but some manual work will be needed - there is no way to do it just by looking for keywords like "free" which might be used in a perfectly non-spam post, e.g. "free up resources" might appear a lot on this site.

And for any other categories of spam, you'd have to do what I did above: identify some test functionally that can't false-positive and devise or seek a solution on that basis. If any spam does not fall into these categories and you can't figure it out, post it specifically as an example.

__________________________________________________________________________________

^M Free your mind!

  • Comment on Re: Filtering SPAM hot words from Message Post