eyidearie has asked for the wisdom of the Perl Monks concerning the following question:

--Update-- Hi everyone! I was able to complete my project, thanks to all your help. Hope you all have a nice day! -- Hello everyone, I'm really new to this and trying to complete a project at the same time. I'd really appreciate any granules of Perl wisdom that can be shared with me. I need to: * Read in the data in a particular column of the sheets of an excel file * search through two lists of names for close matches, not exact matches. i.e. Anne Palemoon being compared with Ann Palemoon should return a 'found' or 'true' (or whatever is appropriate) Thank you!
  • Comment on Read a column; Compare strings for close (not exact) matches

Replies are listed 'Best First'.
Re: Read a column; Compare strings for close (not exact) matches
by Codon (Friar) on Jul 07, 2005 at 18:49 UTC

    You could try one of the /sounds like/ type of modules from CPAN. Text::Metaphone and Text::Soundex are two that I have used. Both have limitations. It really depends on what you are willing to consider a "close match". Do you want 'Don Banks' to be a close match for 'Dawn Binx'? You may get odd results.

    Alternately, you will need to better define what constitutes a "close match". Is it something that a human would recognize, but is hard to define in programmatic ways? Off-by-one type matches you may be able to program, but it could be error-prone and expensive (time intensive to run).

    Any additional info you can supply here would help us better guage what you are trying to accomplish.

    Ivan Heffner
    Sr. Software Engineer, DAS Lead
    WhitePages.com, Inc.
      Hi, Thanks everyone, I looked at the soundex module, but somewhere it said that it only considered English words and pronounciations... some of the names I have to work with are Indian, Chinese and even African :-( The problem in more details: I have a group of people calling two different helpdesks. They give their names to be identified. I need to find out which people call both helpdesks, and I can only use their names as identification: they do not give their IDs, and their departments change quite often so that would be not useful to compare. But these can be spelled wrongly by the helpdesk personnel; also, for example, a Robert Carlos could give his name as Bob Carlos sometimes :-( I don't know that I can do much about short forms of names, but I would like different spellings to be identified as the same person as much as possible. Thank you very much, Eyi.
        Lingua::EN::MatchNames would be a start - I'm not sure how much support there is for indian/chinese/african names
Re: Read a column; Compare strings for close (not exact) matches
by Transient (Hermit) on Jul 07, 2005 at 18:43 UTC
      Thanks Transient. I'm looking at it, but I'm not sure it'd work for non-English names :-( But I'll try and see what results I get.
Re: Read a column; Compare strings for close (not exact) matches
by ww (Archbishop) on Jul 07, 2005 at 18:45 UTC
      Thanks ww, this is certainly helpful. Checking it out...
Re: Read a column; Compare strings for close (not exact) matches
by Limbic~Region (Chancellor) on Jul 07, 2005 at 19:07 UTC
Re: Read a column; Compare strings for close (not exact) matches
by friedo (Prior) on Jul 07, 2005 at 19:04 UTC
    In addition to the Soundex type stuff, the module Spreadsheet::ParseExcel works great for getting data out of Excel sheets.
Re: Read a column; Compare strings for close (not exact) matches
by flogic (Acolyte) on Jul 07, 2005 at 20:38 UTC
    Not something I'd recommend. It's what I did before I found out about soundex and metaphone. However it is a cute regex trick.
    sub offby { my($dist,$word,@possibles)=@_; my($re)=sprintf( q|^%s:%s$|, join(q|.?|,map{q|(\w*)|}(0..$dist)), join(q|.?|,map{"\\".($_+1)}(0..$dist)) ); my(@suggestions); #print "user $word regexp $re\n"; foreach(@possibles) { my($str)="$word:$_"; if($str=~/$re/) { push(@suggestions,$_); } } return(@suggestions) }
Re: Read a column; Compare strings for close (not exact) matches
by TedPride (Priest) on Jul 07, 2005 at 19:52 UTC
    Checking for one-letter differences isn't hard. Neither is checking for two letters that have been swapped by mistake. However, matching according to sound differences, especially in several difference languages at once, is going to be next to impossible. What you need is a learning system - one that lists all unmatched last names for both desks in alphabetic order, and remembers which name pairs you pick. Once all last names have been matched up or marked as having no match, you can then sort by last names (each last name being followed by its aliases), and go to work on first names. After several months of use, the system should be able to figure out almost all the matches on its own.

    I assume you have a large sampling of names to work from? Otherwise you could do this by hand, and wouldn't need a script for it.