in reply to regex: seperating parts of non-formatted names

As dws pointed out, this is a deceptively tricky problem to solve in a reliable way. I did some consultancy work for a company trying to clean up the names and address in there customer database (6M customers as I recall) to save money on their bulk postings. The development team assigned to this task had spent 2 years without producing an acceptable solution and the vendor solutions I was evaluating were in the region of GBP250K for the licensing costs alone. Each vendor solution also required additional work to "tailor" the solution to local needs, and even then there was about a 1% "failure to parse" rate.

The general method of processing started with tokenizing the input data and then ranking the tokens based on the frequency and placement and then iterating over the result to move the tokens to the correct list (titles, firstname, initials,surname, qualifications etc).

rdfield

  • Comment on Re: regex: seperating parts of non-formatted names