Re: Merge/Purge address data

I offer no concrete solution, but I will explain why you can't do bayes. bayes theorum is a classification method. You classify a subset of having some quality and then ask the quesetion, "does this unclassified text belong in that subset". It does that by statistics of the reuse of words across the subset and the percentage of chance they will be seen again.

i.e. if 80% of a predetermined of "smart essays" contained the word "oxymoron" or "phlem", then if a new. essay was brought along with the word phlem, there's a high chance it would be smart and a very low chance it wouldn't be.

So, if you have a "high usage" of smart words, you'd have a "smart essay". Spamassassin (w/ bayes) works the same way. Anytime it finds spam, classified by you or the sytem (figuring it is spam by other methods), it analizes it and says, "this mail has these words in it. adjust how common the words are to my prior knowledge and increase their relative percentages that if these words appear, it's prolly a spam message". Towards the opposite end of things, if a mesage isn't spam, it lowers the percentages of the words appearing in non spam.

It's why bayes filtering requires training. If you get a lot of mail about visual basic that you didn't solicit, but you've marked as spam, those words would increase the chance that your message is spam. But if I were a visual basic programmer, that'd be stupid, since I'd expect a lot of mail about visual basic and may be interested. So "basic might have a high percentage for you and low for me. Thus, the training process.

So you see, classifying how common one thing is to another requirs a sample as an example. Having variations of people's names and classifying them as "yeah, this is bob smith" or "no, this isn't" would requier prior analysis as examples of various combinations. For instance, I sign my name in one of 3 ways. If a 4th variation came up, my signif other would say, "I've seen how he commonly does it.. this 4th way doesn't have qualities of the 3rd, so it isn't his most likely."

So unless you feel like sitting and building various variations by hand, and then verifying them later to train your filter for various addresses (yes, Dr. Alan; Dr Alan Belfour, Dr Alan Belfour MD are the same person) and generage a small database per person on who is and who isn't said person, don't do bayes It doesn't work on arbitrary data w/o trainong on that type of data. Your types change as each person is a type w/ a specific addres. plus, you'd create a database for each, "is this vs is this not," comparison. For small n people, it'd be great.

You might get away w/ some other homebrew. I.e. stripping all abreviations, and adding common ways of representing various parts of the name in an sorted array and seeing which array is it most common to. Usign soundex to compare the parts would work. It's not bayes since you ahven't done any statistical analysis to train it (haven't trained the statistical analysis?). All-in-all, this can be a very expensive op, since humans like to type freeform, and freeform typing is always hard to analize. "William gates" "bill gates" "gates, bill" "mr bill gates ceo" ....

Play that funky music white boy..

Comment on Re: Merge/Purge address data

Replies are listed 'Best First'.
Re: Re: Merge/Purge address data by cleverett (Friar) on Nov 12, 2003 at 02:44 UTC
Good explanation of Bayes. Certainly clarified things for me ... I was actually thinking of doing Bayes on different ways that the data could get mangled. I think I have an idea of what might work: Compute the "distance" from one string to another in terms of 3 different operations. inserting a character deleting a character swapping substrings Having it run from 0 being as far apart as possible to 1 being identical strings would be ideal. Manually figure out a point scoring system so that for each column in the two records I add and subtract points from an overall score based on a function of the distance separating the two strings. Call the records duplicates when the overall score goes over a predefined threshold. So superficially at least it looks like SpamAssassin ...	[reply]
Re x3: Merge/Purge address data by pernod (Chaplain) on Nov 12, 2003 at 08:34 UTC
Compute the "distance" from one string to another in terms of 3 different operations. inserting a character deleting a character swapping substrings Are you talking about Levensthein distance? From the documentation of Text::Levensthein: The Levenshtein edit distance is a measure of the degree of proximity between two strings. This distance is the number of substitutions, deletions or insertions ("edits") needed to transform one string into the other one (and vice versa). When two strings have distance 0, they are the same. There is also a Text::LevenstheinXS. I have not used any of these modules myself, though, so I can't say anything more about their qualities. Perhaps this might save you some work. Good luck! pernod -- Mischief. Mayhem. Soap.	[reply]
Re: Re x3: Merge/Purge address data by cleverett (Friar) on Nov 12, 2003 at 09:55 UTC
Excellent! Many thanks.	[reply]