comment on

Hello Monks,

I am trying to write a script that will help out a fellow co-worker who has not yet been enlightened of the powers of perl. I already managed to impress when I took 5 minutes to write a script that ran for 30s, that saved her at least an hour of work. She has a database full of names that follow no specific format, that she needs to seperate down to

1) title
2) first name
3) middle initial
4) last name
[download]

Some might have all this information, some might not.

I know that this is feasible with a fairly complex regex, which is where I'm running into some problems. I'm sure I could put something together that would work fairly well, but I want to try and write code that will perform appropriately for all cases.

To show that I'm not just asking you guys to solve my problem, I have come up with some ideas that I think need to be incorporated into the regex.

there are multiple titles that are possible (i.e. - LTC, COL, DR, MS, MR, MISS, etc); instead of having a long regex testing LTC|DR|MS|MR, would it be possible to toss them into an array and have a portion of the regex be executed code that iterates through each possibility in the array and returns the match. That way, as new titles come up, they can easily be added.
the different parts of the name are seperated mostly by spaces: the middle initial could be grabbed with (\w\.) and the first and last names could be grabbed based on \w versus spaces. Is there a better approach?
there are certain names that are only last names; there could be a special case for this that would lessen the complexity of the regex.

Here's an example of what I'm looking for. Say I had the following names:

Frederick H. Jones
Dr. James T. Taylor
Dr. Mat L. R. Michaels
[download]

I'd want to be able to seperate this into:
(< > marks chunk tossed into variable)

<Frederick> <H.> <Jones>
<Dr.> <James> <T.> <Taylor>
<Dr.> <Mat> <L. R.> <Michaels>
[download]

I'm going to start working on this regex and toy around with different ideas. I'll post what I have completed every so often, but any feedback, ideas, suggestions, code would be appreciated.

Thanks in advance,
Eric

In reply to regex: seperating parts of non-formatted names by emilford

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.