in reply to Fuzzy searching
Different algorithms may work better depending on exactly what piece(s) of data you want to de-duplicate on. If you're deduplicating on multiple fields some sort of hybrid de-duplication algorithm may be best. Here's an example deduplication scheme I cooked up for a database of people, where they live, and what their income is:
(String::Approx of LAST_NAME the same) and (INCOME within %5) and (STATE the same)In my experience coming up with a de-duplication scheme for user-entered data is easy. Coming up with a good one is hard and may take weeks or months of tuning.
|
---|
Replies are listed 'Best First'. | |
---|---|
RE: Re: Fuzzy searching
by Anonymous Monk on May 18, 2000 at 23:08 UTC |