Nothing I know of, but the real problem would be the list of names. Which country ? What about nicknames and different spellings ?
There are Alias files on the internet which list common names and alternative spellings, so track down one which fits your filenames.
Then it's just a simple match, so not really something for a module to do.
What is the problem you're trying to solve ? Surnames, firstnames ? Names like Jo, Ian, Ann are really going to cause a lot of problems. | [reply] |
This is something I have had need for in more than one project. does string look like a name.
That by itself means module.
It doesn't mean it deserves to be given a swanky namespace and put on cpan- it does mean it would make my work a little tiny bit more convenient.
The fact that I alone looked it up on cpan, means that somebody would make use of this.
When you get down to it, every module on cpan is just a simple match. It just happens to be a simple match that I don't have to write. :-)
Did you actually look at Data::RandomPerson? Look at the source code of one of these modules.
This proves that the module I am searching for, would have been base for Data::RandomPerson. Obviously Peter Hickman (bless his heart!) did not foresee the data to be useful for anything else, and thus harcoded it outside of a named exportable symbol.
| [reply] |
Not to mention O'Reilly, chromatic, Grytpype-Thynne or Cowboy Neal.
| [reply] |
Unless you live in a place like Spain or Italy where strange names are illegal, names are arbitrary (or at least as arbitrary as a given county clerk and judge are in the mood to tolerate). For example, O(+>. So this is not a simple problem but one on a function approaching impossible.
Something like an interface to a flat census DB distributed with a module might be nice though. It would be useful in many cases. Maybe you could give it a stab?
(Update: and speaking as someone with a mildly-unusual-but-entirely-within-traditional-rules name which is nearly always mangled in subscriptions, billing statements, etc, I have to say that the automatic solutions only end up being annoying to the customer).
| [reply] |
Do any of the following strings look like names to you? When you ignore capitalization (which you would tend to do when looking at file names), do any of them look like they might not be names?
Black, White, Green, Brown, May, Will, Mill, Hill, Bill, Hall, Wall, Fields, Woods, Forest, Frank, Earnest, Jewel, Ruby, Gold, Bond, Chuck, Pat, Sales, Miles, Mark, Irons, Steel, Rod, Reed, Robin, Lark, Singer, Ash, Birch, Lily, Iris, Rose, Drew, Dawn, Eve, Spring, Winter, Summer, Autumn, Laurel, Hardy, Abbot, Burns, Grace, Dolly.
Maybe what you really want is to run your file names through a set of procedures that will:
- list possible segmentations of a file name into two or more English words (Lingua::EN::Splitter might help with a limited set of cases)
- assign possible Part-Of-Speech tags for the whole file name and for each word in each possible segmented sequence (cf. Lingua::EN::Tagger)
I don't know about that "Tagger" module, but most English POS taggers will provide the label "proper noun" where appropriate -- given that "appropriate" is based on a statistical likelihood. A really good tagger would return an N-best list rather than just a single "most likely" answer. (Do check out available resources beyond CPAN for POS taggers.)
Or maybe you want to try Lingua::EN::NamedEntity?
What you want to do with all that possible/probable information is another question. The point is, you are looking at a very complicated problem that often poses a challenge to native speakers (who are much better at it than perl scripts, but even so cannot be perfect). You probably could have coded something quickly that might have handled some majority percentage of cases correctly (e.g. 65% or so), but getting beyond that range would take considerably longer.
(Disclaimer: I haven't personally used any of the modules cited above. Some or all of them might be totally unsuitable for your task.) | [reply] |
Well, this CPAN module might be part of your solution:
Lingua::EN::MatchNames
But as other people have pointed out, you're going to have deal with a substantial amount of ambiguity about what a valid human name looks like.
If you do some web searches, you can find some on-line white pages like this one, for example:
Maryland Government Employees. You could use LWP to feed candidate strings into this form, and if it returns some hits, you know it's not too unusual as a human name.
Using this algorithm, I can confirm that "Brenner" is a valid name, which is a relief.
| [reply] |