efreed has asked for the wisdom of the Perl Monks concerning the following question:

I have a long lost (4000+) of doctors that I want to dump into a database, but the list is not consistant with its use of titles; for example some are MD, some are M.D. etc. The list also includes dentists (DDS or D.D.S), registered nurses (RN), etc. And the form it comes in is like this:

LastName, FirstName title

But some First Names have a space (like Mary Jane Smith, which might be "Smith, Mary Jane MD" or "Smith, Mary Jane M.D."!

I am stuck on making a regex to pull out the names and titles. I can easily get the last name before the "," but the first name and title is killing me 'cause there are so many variation of titles and first name spacing.

Any help?

  • Comment on Matching Doctors (MD, DR, M.D., DO etc) help

Replies are listed 'Best First'.
Re: Matching Doctors (MD, DR, M.D., DO etc) help
by BrowserUk (Patriarch) on Jan 16, 2003 at 17:44 UTC

    From your descripton and examples, you might be able to simply assume that the last space delimited "word" is the title? In which case something like

    my ($lastname, $firstnames, $title) = $name =~ /^\s*([^,]+),\s*(.+)\s+ +(\S+)\s*$/;

    might work for you. (That's untested but should give the general idea).

    If that fails, post a few examples of those that it fails on and someone will probably be able to improve it for you.

    Update: I did a little testing and that regex seems to work provided that the title consists of a single "word" where word is defined as a string on non-space chars, and that is the last non-space string on the line. If you need to seperate out any middle initials, or if there are some without the title present, you'll need to match that word against a list (hash) of possible titles and decide whether it is part of the first names or initial or the title on that basis. If David Robert Smith uses his initials and omits his title, you have a problem:)


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      That was it, when I assume that all titles are at least two characters to avoid middle initials with no title people (Smith, John Q). Many Thanks
Re: Matching Doctors (MD, DR, M.D., DO etc) help
by vek (Prior) on Jan 16, 2003 at 17:44 UTC
    You might want to give Lingua::EN::NameParse a try. It's been a while since I've used it but I seem to recall it being able to accomplish something similar to what you are attempting with your regex.

    -- vek --