Re: Read a column; Compare strings for close (not exact) matches
by Codon (Friar) on Jul 07, 2005 at 18:49 UTC
|
You could try one of the /sounds like/ type of modules from CPAN. Text::Metaphone and Text::Soundex are two that I have used. Both have limitations. It really depends on what you are willing to consider a "close match". Do you want 'Don Banks' to be a close match for 'Dawn Binx'? You may get odd results.
Alternately, you will need to better define what constitutes a "close match". Is it something that a human would recognize, but is hard to define in programmatic ways? Off-by-one type matches you may be able to program, but it could be error-prone and expensive (time intensive to run).
Any additional info you can supply here would help us better guage what you are trying to accomplish.
Ivan Heffner
Sr. Software Engineer, DAS Lead
WhitePages.com, Inc.
| [reply] |
|
|
Hi,
Thanks everyone,
I looked at the soundex module, but somewhere it said that it only considered English words and pronounciations... some of the names I have to work with are Indian, Chinese and even African :-(
The problem in more details:
I have a group of people calling two different helpdesks. They give their names to be identified. I need to find out which people call both helpdesks, and I can only use their names as identification: they do not give their IDs, and their departments change quite often so that would be not useful to compare.
But these can be spelled wrongly by the helpdesk personnel; also, for example, a Robert Carlos could give his name as Bob Carlos sometimes :-( I don't know that I can do much about short forms of names, but I would like different spellings to be identified as the same person as much as possible.
Thank you very much,
Eyi.
| [reply] |
|
|
Lingua::EN::MatchNames would be a start - I'm not sure how much support there is for indian/chinese/african names
| [reply] |
Re: Read a column; Compare strings for close (not exact) matches
by Transient (Hermit) on Jul 07, 2005 at 18:43 UTC
|
| [reply] |
|
|
Thanks Transient. I'm looking at it, but I'm not sure it'd work for non-English names :-( But I'll try and see what results I get.
| [reply] |
Re: Read a column; Compare strings for close (not exact) matches
by ww (Archbishop) on Jul 07, 2005 at 18:45 UTC
|
| [reply] |
|
|
Thanks ww, this is certainly helpful. Checking it out...
| [reply] |
Re: Read a column; Compare strings for close (not exact) matches
by Limbic~Region (Chancellor) on Jul 07, 2005 at 19:07 UTC
|
eyidearie,
I already gave some advice over here, which I will repeat here for the benefit of everyone.
That is to look at a few modules on CPAN, and see if they fit your needs.
If you need more specific help after that, come back and augment your question with more details.
| [reply] |
|
|
| [reply] |
Re: Read a column; Compare strings for close (not exact) matches
by friedo (Prior) on Jul 07, 2005 at 19:04 UTC
|
In addition to the Soundex type stuff, the module Spreadsheet::ParseExcel works great for getting data out of Excel sheets. | [reply] |
Re: Read a column; Compare strings for close (not exact) matches
by flogic (Acolyte) on Jul 07, 2005 at 20:38 UTC
|
Not something I'd recommend. It's what I did before I found out about soundex and metaphone. However it is a cute regex trick.
sub offby {
my($dist,$word,@possibles)=@_;
my($re)=sprintf(
q|^%s:%s$|,
join(q|.?|,map{q|(\w*)|}(0..$dist)),
join(q|.?|,map{"\\".($_+1)}(0..$dist))
);
my(@suggestions);
#print "user $word regexp $re\n";
foreach(@possibles) {
my($str)="$word:$_";
if($str=~/$re/) {
push(@suggestions,$_);
}
}
return(@suggestions)
}
| [reply] [d/l] |
Re: Read a column; Compare strings for close (not exact) matches
by TedPride (Priest) on Jul 07, 2005 at 19:52 UTC
|
Checking for one-letter differences isn't hard. Neither is checking for two letters that have been swapped by mistake. However, matching according to sound differences, especially in several difference languages at once, is going to be next to impossible. What you need is a learning system - one that lists all unmatched last names for both desks in alphabetic order, and remembers which name pairs you pick. Once all last names have been matched up or marked as having no match, you can then sort by last names (each last name being followed by its aliases), and go to work on first names. After several months of use, the system should be able to figure out almost all the matches on its own.
I assume you have a large sampling of names to work from? Otherwise you could do this by hand, and wouldn't need a script for it. | [reply] |