in reply to Merge/Purge address data
i.e. if 80% of a predetermined of "smart essays" contained the word "oxymoron" or "phlem", then if a new. essay was brought along with the word phlem, there's a high chance it would be smart and a very low chance it wouldn't be.
So, if you have a "high usage" of smart words, you'd have a "smart essay". Spamassassin (w/ bayes) works the same way. Anytime it finds spam, classified by you or the sytem (figuring it is spam by other methods), it analizes it and says, "this mail has these words in it. adjust how common the words are to my prior knowledge and increase their relative percentages that if these words appear, it's prolly a spam message". Towards the opposite end of things, if a mesage isn't spam, it lowers the percentages of the words appearing in non spam.
It's why bayes filtering requires training. If you get a lot of mail about visual basic that you didn't solicit, but you've marked as spam, those words would increase the chance that your message is spam. But if I were a visual basic programmer, that'd be stupid, since I'd expect a lot of mail about visual basic and may be interested. So "basic might have a high percentage for you and low for me. Thus, the training process.
So you see, classifying how common one thing is to another requirs a sample as an example. Having variations of people's names and classifying them as "yeah, this is bob smith" or "no, this isn't" would requier prior analysis as examples of various combinations. For instance, I sign my name in one of 3 ways. If a 4th variation came up, my signif other would say, "I've seen how he commonly does it.. this 4th way doesn't have qualities of the 3rd, so it isn't his most likely."
So unless you feel like sitting and building various variations by hand, and then verifying them later to train your filter for various addresses (yes, Dr. Alan; Dr Alan Belfour, Dr Alan Belfour MD are the same person) and generage a small database per person on who is and who isn't said person, don't do bayes It doesn't work on arbitrary data w/o trainong on that type of data. Your types change as each person is a type w/ a specific addres. plus, you'd create a database for each, "is this vs is this not," comparison. For small n people, it'd be great.
You might get away w/ some other homebrew. I.e. stripping all abreviations, and adding common ways of representing various parts of the name in an sorted array and seeing which array is it most common to. Usign soundex to compare the parts would work. It's not bayes since you ahven't done any statistical analysis to train it (haven't trained the statistical analysis?). All-in-all, this can be a very expensive op, since humans like to type freeform, and freeform typing is always hard to analize. "William gates" "bill gates" "gates, bill" "mr bill gates ceo" ....
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Merge/Purge address data
by cleverett (Friar) on Nov 12, 2003 at 02:44 UTC | |
by pernod (Chaplain) on Nov 12, 2003 at 08:34 UTC | |
by cleverett (Friar) on Nov 12, 2003 at 09:55 UTC |