in reply to Re: Merge/Purge address data
in thread Merge/Purge address data

Good explanation of Bayes. Certainly clarified things for me ...

I was actually thinking of doing Bayes on different ways that the data could get mangled.

I think I have an idea of what might work:

  1. Compute the "distance" from one string to another in terms of 3 different operations.
    • inserting a character
    • deleting a character
    • swapping substrings
    Having it run from 0 being as far apart as possible to 1 being identical strings would be ideal.
  2. Manually figure out a point scoring system so that for each column in the two records I add and subtract points from an overall score based on a function of the distance separating the two strings.
  3. Call the records duplicates when the overall score goes over a predefined threshold.

So superficially at least it looks like SpamAssassin ...

Replies are listed 'Best First'.
Re x3: Merge/Purge address data
by pernod (Chaplain) on Nov 12, 2003 at 08:34 UTC
    1. Compute the "distance" from one string to another in terms of 3 different operations.
      • inserting a character
      • deleting a character
      • swapping substrings

    Are you talking about Levensthein distance? From the documentation of Text::Levensthein:

    The Levenshtein edit distance is a measure of the degree of proximity between two strings. This distance is the number of substitutions, deletions or insertions ("edits") needed to transform one string into the other one (and vice versa). When two strings have distance 0, they are the same.

    There is also a Text::LevenstheinXS. I have not used any of these modules myself, though, so I can't say anything more about their qualities.

    Perhaps this might save you some work. Good luck!

    pernod
    --
    Mischief. Mayhem. Soap.

      Excellent! Many thanks.