Three of the most common types of spam I get are

(1) attempts to get my personal details by telling me there are millions of dollars waiting for me

(2) a random word generated style

(3) mailorder (and in your case other) advertisements

The first category is always a complicated story but at least it can be identified by the expected list of information being asked for in some section of the spam. The second type can only be identified by excess of grammatical error and the third by the kind of site it links to.

So I would spec. the following functionality:

- for type 1: fuzzy matching on lists of personal details placeholders (e.g. "name, address, \w*phone, etc.") - see perlre - i.e. has to be hand-rolled as far as I know.

- Lingua::EN is a space containing several modules that can parse English grammar in different ways - 30/30 fails or some such poor score and you have detected type (2) above. (Update: I took another look at such a spam and these are recently getting grammatically clever where they used to be random words, so a score of as high as 95% grammatically correct might now need to be deemed a spam!!)

- for type 3, you are generally looking for links to URLs that use spam to advertise. SpamAssassin can help maintain a list of blocked URLs but some manual work will be needed - there is no way to do it just by looking for keywords like "free" which might be used in a perfectly non-spam post, e.g. "free up resources" might appear a lot on this site.

And for any other categories of spam, you'd have to do what I did above: identify some test functionally that can't false-positive and devise or seek a solution on that basis. If any spam does not fall into these categories and you can't figure it out, post it specifically as an example.

__________________________________________________________________________________

^M Free your mind!


In reply to Re: Filtering SPAM hot words from Message Post by Moron
in thread Filtering SPAM hot words from Message Post by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.