in reply to Re^2: Fuzzy text matching... again
in thread Fuzzy text matching... again
And while we're at it: how would a machine identify what is a name and what not
I can't remember mentioning names anywhere in my response. Name matching has a completely different kind of complexity (Mark and Mary only have an edit distance of 1; Richard, Rich, Dick, Dickie may all refer to the same person; cultural, ethnic and religious variations to a name; gender; variations in converting from native to latin-1; etc). And yes, there are huge databases of names (common and otherwise) to deal with this. Check out Global Name Recognition software owned by IBM for a very expensive solution to that problem.
Of course, names are not the only specific kinds of strings that have their own complexity. Mailing addresses, dates, identifying an anonymous author, identifying plagerism, indexing, etc. I limited my advice to the problem as I understood it. I obviously could have missed the mark thinking these were publication citations making the "what is a name" germane but I don't see why it can't just be treated like any other token. I obviously missed the trailing parens sometimes being important as a translation but I assume the OP will be capable of constructing an approach for removing unimportant tokens.
Cheers - L~R
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Fuzzy text matching... again
by almut (Canon) on Jan 07, 2010 at 18:03 UTC |