Spam filtering regexp - keyword countermeasure countermeasure

John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Spam filtering regexp - keyword countermeasure countermeasure by AssFace (Pilgrim) on May 12, 2003 at 16:12 UTC
If you want a general algorithm that will handle all kinds of possibilities - then you want something that is ignorant of the actual content and instead does statistical analysis on what it "sees" In other words, you want Bayesian analysis done over a structure similar (if not exactly) like a Markov Matrix. But if you are going to do that - be warned that you are repeating the work of the very successful SpamAssassin. (I've used spamassassin for awhile now and it kicks ass for spam) If you are writing something just as a programming exercise, then look into that - if you are writing it to solve your spam problem - then I would first look into SpamAssassin. ------------------------------------------------------------------- There are some odd things afoot now, in the Villa Straylight.	[reply]
Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by CountZero (Bishop) on May 12, 2003 at 18:42 UTC
POPFile of course does use Bayesian analysis and over the few months I am using it now, it catches about 98% of the spam. The ones it didn't catch, were mostly short messages (probably too few words to analyse). And of course if you start calling Viagra "Sildenafil Citrate", even the best analyst gets confused (once). CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by John M. Dlugosz (Monsignor) on May 12, 2003 at 19:10 UTC
I am doing Bayesian analysis, via POPFile. However, that works with words, so any non-word property I can detect elsehow I can add as a "keyword" to be considered by the Bayesian analysis.	[reply]
Re: Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by AssFace (Pilgrim) on May 12, 2003 at 19:22 UTC
Sorry, I guess I wasn't specific enough. What I meant was to do Bayesian analysis on it so that it is totally independant from the language that the text is in. You don't want it to be at all aware of the words that it is looking at - instead you want to look at the statistical frequency that sub sections make up. (although technically you could also use sections larger than words - as long as it includes whitespace and characters - you don't want to only use words though) For instance trigraphs usually perform well in that respect. You could even break it down to the character level if you want, but that will slow it down considerably. To gain the real benefits of Bayesian analysis, you don't want it to be aware of any words at all - that defeats the purpose - or at least doesn't play to its strength. I would try playing with it at different levels - bi and trigraphs are going to perform well, but will be slower - looking at five characters at a time might prove to work well - would have to test it all out. So you would break a phrase up into the subsections, dump that into your structure (usually a Markov Matrix in the end) and then calculate the weights on it. Then you learn on good and bad mail and the structures learn how the weights work for that. Then as new mail is compared against that structure, you see what weight that it comes away with and it will then sort out the mail accordingly. do note that when you are doing the character analysis - you count every character - including spaces (even multiples in a row) and line breaks. In the end, I'm not sure why you would want to do it on your own isntead of just using spamassassin. I have used it and went from getting well over 100 spam a day down to never getting spam anymore. (well, I get them, but they get filter out and I never see them) ------------------------------------------------------------------- There are some odd things afoot now, in the Villa Straylight.	[reply]
Re: Re: Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by John M. Dlugosz (Monsignor) on May 13, 2003 at 21:48 UTC
Re: Re: Re: Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by AssFace (Pilgrim) on May 14, 2003 at 01:03 UTC
Re: Spam filtering regexp - keyword countermeasure countermeasure by Popcorn Dave (Abbot) on May 12, 2003 at 16:40 UTC
A couple of the things I've been seeing a lot of lately is using "leet" speak and HTML with lots of <!--Dave in them. Even filtering out the "leet" speak would probably solve a lot of your spam. Something simple like `/[a-zA-Z]\d/`would be cause to flag the message as an offender I would think. Or at the very least it's a start. Also, and someone will correct me if I'm mistaken here, I can't see any need to have comments in an HTML e-mail message that contain my name. To me, anytime I see html comments, it's suspect. Good luck! There is no emoticon for what I'm feeling now.	[reply] [d/l]
Re: Spam filtering regexp - keyword countermeasure countermeasure by CountZero (Bishop) on May 12, 2003 at 19:09 UTC
I am also using POPFile and I didn't notice that it failed to catch spam with "split" words. Most of the time it seems that there are sufficient other clues to categorize the e-mail as spam or not. If it ain't broken, don't repair it! However, as it is an interesting question, I would go along the following track: First, delete all "funny" characters (, #, _, ...). Then collapse all whitespace in between single alphabetic characters (this might catch the occasional 'I' or 'a', but that would be quite rare I think. If it happens too much, you could make an 'I' or 'a' at the beginning of end of a sequence of a 'split' word an exception to the rule. Alas, 'V I A G R A' would then become 'VIAGR' and 'A' and 'I N T E R E S T R A T E S D R O P' would make 'I' and 'NTEREST RATES DROP', but POPFile would quickly catch on to these as well) CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler."* - Conway's Law	[reply]
Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by shemp (Deacon) on May 12, 2003 at 21:15 UTC
To help trying to avoid inadvertently collapsing a legitimate single character, you could maybe use the idea that the "funny" separator is the same between things that should be collapsed. (until the spammers catch on to that) It will be a not fun day when they do random grouping of chars in their headers, i.e.: viagra = V IA GR *A But, ignoring that situation, get the separator, " ", for instance, and remove all occurences. just a thought	[reply]
Re: Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by hv (Prior) on May 12, 2003 at 23:05 UTC
Sorry to be the bearer of bad tidings, but I saw some spam a couple of days ago that compared its product to "`vi ag r a`". Hugo	[reply] [d/l]
Re: Re: Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by Popcorn Dave (Abbot) on May 13, 2003 at 04:14 UTC
Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by John M. Dlugosz (Monsignor) on May 12, 2003 at 19:15 UTC
That sounds pretty good. Thanks for the pointer.	[reply]
Re: Spam filtering regexp - keyword countermeasure countermeasure by pzbagel (Chaplain) on May 12, 2003 at 16:26 UTC
First off, SPAM and SPAM filtering is an ever-escalating arms race. You should be so lucky to stay one step ahead of them. Are you able to use something like SpamBayes? Bayesian filtering is quickly becoming best way to deal with spam. If you cannot run these tools or just plain insist on writing this script, perhaps a good tactic would be to remove all punctuation and spaces from the subject line and then use a list of SPAM-ish words(debt,enlarge,coed) and see if they are contained in the subject line. However, this idea will not handle ordinary acronyms. Another tactic might be to take that same list of SPAM-ish words and do stuff like inserting a check for non-alpha characters between each letter: `/d[^A-Za-z]e[^A-Za-z]b[^A-Za-z]*t/i` [download] But that's woefully inefficient. Just my 0.02	[reply] [d/l]
Re: Spam filtering regexp - keyword countermeasure countermeasure by tachyon (Chancellor) on May 13, 2003 at 06:42 UTC
This discussion of Baysian spam filtering should be of use http://www.paulgraham.com/spam.html We run a Baysian web filter that uses phrases. A phrase is a 1, 2 or 3 word token. The tokens are contigous character sets (generally words) but also URL domain links and some other bits and bobs. We use phrases rather than single words as this adds context. Everything is automated, we don't add anything by hand. The system is designed to work hands free. The data sets used to generate these were large (ie 50,000) and the DB (with all single instances removed) covers about 1/2 a million such token phrases. As you get more data you simply re-run the phrase/probability generator which will then take account of new phrases that are commonly appearing in your target content. We use Math::BigFloat as if you use too many phrases your probabilities blow off the ends of the floating point accuracy. Optimal range for tokens to do the Baysian on is (in our testing) 8-20 depending on the data set, if you don't use Big::Float. We pick the phrases that offer the greatest differential (ie good-bad probability difference). Accuracy can be very impressive. We run a sensitivity of > 99.7% and a specificity of 99.9% for porn for example. The probabilities tend to move rapidly either towards 0 or one. A probability level > 0.9 works well in practice but even > 0.5 is still remarkably accurate. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by John M. Dlugosz (Monsignor) on May 13, 2003 at 21:55 UTC
Hmm, one reply sais to use individual chars, and you use groups of words. I can see how that would work, in that D,E,B and E,B,T are both 3-token groups that will be found. So I get the feeling that using Baysian analysis on single whole words (e.g. POPFile) is the worst way to do it! My idea is to add more "context" then POPFile can gleem by itself, by adding special keywords when the preliminary filter spots things.	[reply]
Re: Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by tachyon (Chancellor) on May 15, 2003 at 01:33 UTC
Combining tokens gives context. It allows you to differentiate to an extent between 'are you free tonight' and 'debt free' 'free widgets' etc. If you just run on 'free' as a word you lose sensitivity as this is quite common. Using consecutive tokens is the way to go IMHO and is the method employed in voice recognition (Dragon used to use 2 words for context and IBM 3 I believe). The price is size and speed based. To give you and idea our single word token file is ~ 100K, the two word token file is ~ 10MB and the three word token file > 1GB You will note a rough 2 order of magnitude increase in size as you add length to the phrases. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Re: Re: Re: Spam filtering regexp - keyword countermeasure countermeasure by John M. Dlugosz (Monsignor) on May 15, 2003 at 19:45 UTC
Re: Spam filtering regexp - keyword countermeasure countermeasure by Abigail-II (Bishop) on May 12, 2003 at 20:06 UTC
False positives are OK. In that case: `sub is_spam { my $mail = shift; return 1; }` [download] Abigail	[reply] [d/l]