marekjochec has asked for the wisdom of the Perl Monks concerning the following question:

I have a long list of people's first names. Some entries say "Bill" while some say "William", "Jim-James", "Jo-Joseph" etc. Is there a Perl (or SAS) module or function or some list of Regexes that I could use to classify these as one and the same name? Thanks!
  • Comment on dealing with colloquial forms of people's first names

Replies are listed 'Best First'.
Re: dealing with colloquial forms of people's first names
by Old_Gray_Bear (Bishop) on Jan 18, 2008 at 22:46 UTC
    Take a look at Lingua::EN::Nickname as a start. It's old (last update in 2003), and it's biased toward English names; but it cites its sources. The URLs have changed (no surprise that), but the Genweb project is still alive and there is a link to a nick-names sub-project.

    ----
    I Go Back to Sleep, Now.

    OGB

Re: dealing with colloquial forms of people's first names
by kyle (Abbot) on Jan 18, 2008 at 21:41 UTC

    I wonder if this is an XY problem. Do you have a list of people, and you think some of them are duplicates (i.e., "James Brown" and "Jim Brown")? In that case, I don't think you can be sure that they're really duplicated and not just two people with similar names (or even identical names). If you don't have some other way to uniquely identify them, you're just guessing.

    Update: For a practical example of this problem in action, see here.

      First of all thank you for your time! I have a list of names of managers of U.S. mutual funds by fund and year. The problem is that when fund secretaries submitted entries to the database (I am imagining), one year they write "Jim" Last and next year "James" Last. Also, sometimes they make simply a typo: "Jin" Last. (Often typo's are e->c - I would guess that an OCR read the printouts or scans?). It is very unlikely that there would be two fund managers among about 10,000 fund managers with exactly same name, especially when their last name is not a common one. Sometimes there are such cases - for example "Jr." and "Sr." in family-managed funds, but I am either aware of these cases, or can tolerate some errors. Typos and nicknames are much more often and constitute a much bigger problem. Thanks again.
Re: dealing with colloquial forms of people's first names
by swampyankee (Parson) on Jan 19, 2008 at 15:26 UTC

    First, kudos to kyle for his observation that this isn't going to be a reliable method of checking to see if two names actually point to the same person.

    Second, nicknames are not uniquely associated with a complete given name. Some examples are:

    $nickname{ted} = ['Edward', 'Theodore']; $nicname{ed} = ['Edward', 'Edwin', 'Edgar', 'Edmund'];

    emc

    Information about American English usage here and here.

    Any Northeastern US area jobs? I'm currently unemployed.

Re: dealing with colloquial forms of people's first names
by apl (Monsignor) on Jan 19, 2008 at 02:44 UTC