I am also using POPFile and I didn't notice that it failed to catch spam with "split" words. Most of the time it seems that there are sufficient other clues to categorize the e-mail as spam or not.
If it ain't broken, don't repair it!
However, as it is an interesting question, I would go along the following track:
- First, delete all "funny" characters (*, #, _, ...).
- Then collapse all whitespace in between single alphabetic characters (this might catch the occasional 'I' or 'a', but that would be quite rare I think. If it happens too much, you could make an 'I' or 'a' at the beginning of end of a sequence of a 'split' word an exception to the rule. Alas, 'V *I *A *G *R *A' would then become 'VIAGR' and 'A' and 'I *N *T *E *R *E *S *T *R *A *T *E *S D *R *O *P' would make 'I' and 'NTEREST RATES DROP', but POPFile would quickly catch on to these as well)
CountZero
"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.