I offer no concrete solution, but I will explain why you can't do bayes. bayes theorum is a classification method. You classify a subset of having some quality and then ask the quesetion, "does this unclassified text belong in that subset". It does that by statistics of the reuse of words across the subset and the percentage of chance they will be seen again.

i.e. if 80% of a predetermined of "smart essays" contained the word "oxymoron" or "phlem", then if a new. essay was brought along with the word phlem, there's a high chance it would be smart and a very low chance it wouldn't be.

So, if you have a "high usage" of smart words, you'd have a "smart essay". Spamassassin (w/ bayes) works the same way. Anytime it finds spam, classified by you or the sytem (figuring it is spam by other methods), it analizes it and says, "this mail has these words in it. adjust how common the words are to my prior knowledge and increase their relative percentages that if these words appear, it's prolly a spam message". Towards the opposite end of things, if a mesage isn't spam, it lowers the percentages of the words appearing in non spam.

It's why bayes filtering requires training. If you get a lot of mail about visual basic that you didn't solicit, but you've marked as spam, those words would increase the chance that your message is spam. But if I were a visual basic programmer, that'd be stupid, since I'd expect a lot of mail about visual basic and may be interested. So "basic might have a high percentage for you and low for me. Thus, the training process.

So you see, classifying how common one thing is to another requirs a sample as an example. Having variations of people's names and classifying them as "yeah, this is bob smith" or "no, this isn't" would requier prior analysis as examples of various combinations. For instance, I sign my name in one of 3 ways. If a 4th variation came up, my signif other would say, "I've seen how he commonly does it.. this 4th way doesn't have qualities of the 3rd, so it isn't his most likely."

So unless you feel like sitting and building various variations by hand, and then verifying them later to train your filter for various addresses (yes, Dr. Alan; Dr Alan Belfour, Dr Alan Belfour MD are the same person) and generage a small database per person on who is and who isn't said person, don't do bayes It doesn't work on arbitrary data w/o trainong on that type of data. Your types change as each person is a type w/ a specific addres. plus, you'd create a database for each, "is this vs is this not," comparison. For small n people, it'd be great.

You might get away w/ some other homebrew. I.e. stripping all abreviations, and adding common ways of representing various parts of the name in an sorted array and seeing which array is it most common to. Usign soundex to compare the parts would work. It's not bayes since you ahven't done any statistical analysis to train it (haven't trained the statistical analysis?). All-in-all, this can be a very expensive op, since humans like to type freeform, and freeform typing is always hard to analize. "William gates" "bill gates" "gates, bill" "mr bill gates ceo" ....


Play that funky music white boy..

In reply to Re: Merge/Purge address data by exussum0
in thread Merge/Purge address data by cleverett

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.