Read a column; Compare strings for close (not exact) matches

eyidearie has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Read a column; Compare strings for close (not exact) matches by Codon (Friar) on Jul 07, 2005 at 18:49 UTC
You could try one of the /sounds like/ type of modules from CPAN. Text::Metaphone and Text::Soundex are two that I have used. Both have limitations. It really depends on what you are willing to consider a "close match". Do you want 'Don Banks' to be a close match for 'Dawn Binx'? You may get odd results. Alternately, you will need to better define what constitutes a "close match". Is it something that a human would recognize, but is hard to define in programmatic ways? Off-by-one type matches you may be able to program, but it could be error-prone and expensive (time intensive to run). Any additional info you can supply here would help us better guage what you are trying to accomplish. Ivan Heffner Sr. Software Engineer, DAS Lead WhitePages.com, Inc.	[reply]
Re^2: Read a column; Compare strings for close (not exact) matches by eyidearie (Novice) on Jul 07, 2005 at 19:03 UTC
Hi, Thanks everyone, I looked at the soundex module, but somewhere it said that it only considered English words and pronounciations... some of the names I have to work with are Indian, Chinese and even African :-( The problem in more details: I have a group of people calling two different helpdesks. They give their names to be identified. I need to find out which people call both helpdesks, and I can only use their names as identification: they do not give their IDs, and their departments change quite often so that would be not useful to compare. But these can be spelled wrongly by the helpdesk personnel; also, for example, a Robert Carlos could give his name as Bob Carlos sometimes :-( I don't know that I can do much about short forms of names, but I would like different spellings to be identified as the same person as much as possible. Thank you very much, Eyi.	[reply]
Re^3: Read a column; Compare strings for close (not exact) matches by Transient (Hermit) on Jul 07, 2005 at 19:07 UTC
Lingua::EN::MatchNames would be a start - I'm not sure how much support there is for indian/chinese/african names	[reply]
Re: Read a column; Compare strings for close (not exact) matches by Transient (Hermit) on Jul 07, 2005 at 18:43 UTC
Maybe Text::Soundex could help you out on this one.	[reply]
Re^2: Read a column; Compare strings for close (not exact) matches by eyidearie (Novice) on Jul 07, 2005 at 19:08 UTC
Thanks Transient. I'm looking at it, but I'm not sure it'd work for non-English names :-( But I'll try and see what results I get.	[reply]
Re: Read a column; Compare strings for close (not exact) matches by ww (Archbishop) on Jul 07, 2005 at 18:45 UTC
try some of these: approximate match	[reply]
Re^2: Read a column; Compare strings for close (not exact) matches by eyidearie (Novice) on Jul 07, 2005 at 19:06 UTC
Thanks ww, this is certainly helpful. Checking it out...	[reply]
Re: Read a column; Compare strings for close (not exact) matches by Limbic~Region (Chancellor) on Jul 07, 2005 at 19:07 UTC
eyidearie, I already gave some advice over here, which I will repeat here for the benefit of everyone. That is to look at a few modules on CPAN, and see if they fit your needs. Win32::OLE Spreadsheet::ParseExcel String::Approx Text::Levenshtein If you need more specific help after that, come back and augment your question with more details. Cheers - L~R	[reply]
Re^2: Read a column; Compare strings for close (not exact) matches by runrig (Abbot) on Jul 07, 2005 at 20:11 UTC
For "simple" things, I like Spreadsheet::ParseExcel::Simple.	[reply]
Re: Read a column; Compare strings for close (not exact) matches by friedo (Prior) on Jul 07, 2005 at 19:04 UTC
In addition to the Soundex type stuff, the module Spreadsheet::ParseExcel works great for getting data out of Excel sheets.	[reply]
Re: Read a column; Compare strings for close (not exact) matches by flogic (Acolyte) on Jul 07, 2005 at 20:38 UTC
Not something I'd recommend. It's what I did before I found out about soundex and metaphone. However it is a cute regex trick. `sub offby { my($dist,$word,@possibles)=@_; my($re)=sprintf( q\|^%s:%s$\|, join(q\|.?\|,map{q\|(\w*)\|}(0..$dist)), join(q\|.?\|,map{"\\".($_+1)}(0..$dist)) ); my(@suggestions); #print "user $word regexp $re\n"; foreach(@possibles) { my($str)="$word:$_"; if($str=~/$re/) { push(@suggestions,$_); } } return(@suggestions) }` [download]	[reply] [d/l]
Re: Read a column; Compare strings for close (not exact) matches by TedPride (Priest) on Jul 07, 2005 at 19:52 UTC
Checking for one-letter differences isn't hard. Neither is checking for two letters that have been swapped by mistake. However, matching according to sound differences, especially in several difference languages at once, is going to be next to impossible. What you need is a learning system - one that lists all unmatched last names for both desks in alphabetic order, and remembers which name pairs you pick. Once all last names have been matched up or marked as having no match, you can then sort by last names (each last name being followed by its aliases), and go to work on first names. After several months of use, the system should be able to figure out almost all the matches on its own. I assume you have a large sampling of names to work from? Otherwise you could do this by hand, and wouldn't need a script for it.	[reply]