Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

i want to build a spam filter for my forum.

when people post topics, i don't want the obvious unnecessary posts. you know, posts that contain words like:
f word s word free p0rn! free!~! p enlargement! etc.
I also want to be able to perhaps add custom "hot words" to filter out myself.

so how would i carry out this job? is there perhaps a module built for this type of job? i prefer not to use a module, but if needed be, then i will.

my logic goes as:

..read csv file, split hotwords into @hotwords foreach my $d (@hotwords) { .....use a if statement with some sort of regex to see if the hotwords + fit any pattern inside the forum post msg }
Am I on the right track with this?

THANKS!

Replies are listed 'Best First'.
Re: Filtering SPAM hot words from Message Post
by shigetsu (Hermit) on May 05, 2007 at 23:40 UTC

    Judging from Simon Cozen's SpamMonkey's documentation

    SpamMonkey is a general purpose spam detection suite. It borrows heavily from SpamAssassin, 
    but it is designed to be used for plain text as well as email.

    it may suit your needs?

Re: Filtering SPAM hot words from Message Post
by ww (Archbishop) on May 06, 2007 at 00:10 UTC
    You've set yourself a large task and one that will require a large dictionary of forbidden words. First, I hope I'm clear on the first 3 lines of your logic: the csv file will contain the list of verboten words and $d will (sequentially) hold individual "hotwords" for processing (and you'd be better off to call that $hotwords or something clearer than a single character variable name.)

    But where's the user input? Ah, need another var, say, $in_msg.

    And then you want to compare the input with the forbidden words... so you're going to walk the text, with another split (of $in_msg to, say, @suspect and then compare each of your $hotwords with each word in @suspect? (caveat: if you use eq rather than a regex you'll need all the possible VaRIAtions OF cAsE. On the other hand, if you go with a regex (case insensitive, one presumes), please share the construct that matches all your possible prohibitions as :-) I'm not familiar with that one.

    And now you have to decide what to do about the offensive words. Are you going to simply replace them with something that's less offensive -- say "XXXX?" Or are you going to send the whole post to the bitbucket? Or someting else? More thought required here.

    Also, suppose "free p0rn!" is in your hotwords file but the poster writes "free porn." Now we're dealing not only with enlarging your hotwords dictionary; we may also be having to deal with parsing natural language. And that's hard: suppose the poster is writing to condemn "free porn" (or any other kinds). Do you want that post banned?

    Or, just one more before the examples get tedious: Suppose your posted writes "Free Guppies" or some other captive group? What now?

    Perhaps you've thought this through before; but if not, it may well be worth your time and trouble to reconsider your distaste for using a module.

      i wasn't really trying to write any code, rather just allow someone to see the logic. i'm also sorry that I did not mention that if somehow "SPAM" is detected, it will be posted, but then "FLAGGED" so moderators can be quicker to it. but anyways, good module. did not spot that while searching last night. i'm grrrrrrrrrrrrrrrr'ing like a bear
Re: Filtering SPAM hot words from Message Post
by naikonta (Curate) on May 06, 2007 at 01:16 UTC
    You may want to check Regexp::Common specially the profanity pattern.

    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Re: Filtering SPAM hot words from Message Post
by Moron (Curate) on May 07, 2007 at 09:51 UTC
    Three of the most common types of spam I get are

    (1) attempts to get my personal details by telling me there are millions of dollars waiting for me

    (2) a random word generated style

    (3) mailorder (and in your case other) advertisements

    The first category is always a complicated story but at least it can be identified by the expected list of information being asked for in some section of the spam. The second type can only be identified by excess of grammatical error and the third by the kind of site it links to.

    So I would spec. the following functionality:

    - for type 1: fuzzy matching on lists of personal details placeholders (e.g. "name, address, \w*phone, etc.") - see perlre - i.e. has to be hand-rolled as far as I know.

    - Lingua::EN is a space containing several modules that can parse English grammar in different ways - 30/30 fails or some such poor score and you have detected type (2) above. (Update: I took another look at such a spam and these are recently getting grammatically clever where they used to be random words, so a score of as high as 95% grammatically correct might now need to be deemed a spam!!)

    - for type 3, you are generally looking for links to URLs that use spam to advertise. SpamAssassin can help maintain a list of blocked URLs but some manual work will be needed - there is no way to do it just by looking for keywords like "free" which might be used in a perfectly non-spam post, e.g. "free up resources" might appear a lot on this site.

    And for any other categories of spam, you'd have to do what I did above: identify some test functionally that can't false-positive and devise or seek a solution on that basis. If any spam does not fall into these categories and you can't figure it out, post it specifically as an example.

    __________________________________________________________________________________

    ^M Free your mind!