I'm contemplating writing a pre-filter to use on incoming email before it's processed by POPFile. POPFile and tools like it operate on words, so the email must be parsed into word tokens first.
Now you and I know that a Subject containing "D *E *B *T" is spam, but this clue is totally invisible to POPFile.
I'm thinking if a Perl program can detect things like this and put the original word into a header, then that word will be seen by POPFile, and the header itself is also a clue that things like this were detected.
I've long seen words separated out with some punctuation mark, and that's easy to write a simple regexp for. But this is the first time I've seen spaces and punctuation used at the same time.
Rather than staying one step ahead with constant updating, I'm hoping to figure out a general enough algorithm (probably not a simple regexp anymore) that can handle anything like that. False positives are OK. It should find groups of letters that are related as a "word" even though they are not contiguous in a file. The algorithm would handle ordinary acronyms (F.L.A.R.P.) as a special case.
Any ideas?
—John
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.