in reply to regex: seperating parts of non-formatted names

She has a database full of names that follow no specific format, that she needs to seperate down to [title, first name, middle initial, last name]

Welcome to the administrative sub-basement of hell. Depending on the names your friend has to deal with, you might discover that a regex can handle 98%, but that the remaining 2% will cause you to run screaming into the night.

Consider my dear friend   Lt. Col. J. Random von Perl-Hacker III By the scheme your friend is using, Randy's name needs to reduce to   <Lt. Col.> <J.> <R.> <von Perl-Hacker> (And it isn't immediately clear what to do with the "III".) In any large set of unstructured names, you're going to run into a few like this. Good luck doing handling them with a single regexp.

I think you'll have better luck breaking the name into tokens, providing predicate functions that answer whether a token can be of a particular type, then providing a set of "rules" to match a set of tokens against. This will be slower, but potentially much more accurate, than a regex.

Replies are listed 'Best First'.
Re: Re: regex: seperating parts of non-formatted names
by jkahn (Friar) on Sep 09, 2002 at 18:48 UTC
    I can't agree enough.
    You will save yourself brain hurt by tokenizing first, so at least you have some idea where the word boundaries are in some reliable way. Then you need the predicate functions, as the previous poster pointed out.
    Perhaps the following snippet makes sense:
    sub parse_names @uncategorized = tokenize($erstwhile_name); push (@titleToks, shift @uncategorized) until ( not is_title($uncategorized[0]) or not @uncategorized ); if (not @uncategorized) { warn "all titles!"; return; } push (@nameToks, shift @uncategorized) until ( not is_name($uncategorized[0]) or not @uncategorized ); # now probably want to break up @nameToks into first and # last names; this probably involves specific lists like # "van" and "von" and "de" so you attach "de Sade", "van # Gogh" to the last name, but "Robert Louis" to the first # name if (@uncategorized) { # must be suffixes like "III", "Jr.", etc @suffixes = @uncategorized; } if (not is_acceptable_suffix(@suffixes)) { warn "problem with suffixes " . join " ", @suffixes; } }
Re: Re: regex: seperating parts of non-formatted names
by sauoq (Abbot) on Sep 09, 2002 at 23:32 UTC
    Consider my dear friend
    Lt. Col. J. Random von Perl-Hacker III
    I know him! He finally got that degree... Now he gives his name as
    Lt. Col. J. Random von Perl-Hacker III Ph.D.

    Good luck!

    . . . I wonder if there is a unicode glyph for "the artist formerly known as Prince" . . .

    -sauoq
    "My two cents aren't worth a dime.";