leocharre has asked for the wisdom of the Perl Monks concerning the following question:

I have a large set of filenames that may or may not have names (as part of the string). I've been using regexes and some logic to determine this.

There has to be something in cpan that I can use to test if a string is a name or has a name in it.. etc.

But I don't find that, the closest I find is data in Data::RandomPerson, which has.. some census data to pull a random name, etc.

I want to be able to do something like..

use Lingua::EN::Names 'is_name'; # ?? my $string = '123-James-Rubyn2134_docs.pdf'; my @names = grep{ is_name( lc($_) ) } split( /\W+/, $string ); print "Names : @names\n";

Now, I could have coded this in the time it took to look it up, fail, and ask. But- Is something like this out there? Like.. a Data Census list that can easily be queried.. I'm having a hard time believing it's not just me left wanting on my search skills.

update

Lingua::Names . It's a hack.. but it has a test suite and .. dammit.. it works. Does what I needed.

Replies are listed 'Best First'.
Re: missing simple 'is this a name' module on cpan?
by u671296 (Sexton) on Feb 18, 2009 at 15:10 UTC
    Nothing I know of, but the real problem would be the list of names. Which country ? What about nicknames and different spellings ?
    There are Alias files on the internet which list common names and alternative spellings, so track down one which fits your filenames.

    Then it's just a simple match, so not really something for a module to do.

    What is the problem you're trying to solve ? Surnames, firstnames ? Names like Jo, Ian, Ann are really going to cause a lot of problems.

      This is something I have had need for in more than one project. does string look like a name.

      That by itself means module.

      It doesn't mean it deserves to be given a swanky namespace and put on cpan- it does mean it would make my work a little tiny bit more convenient.

      The fact that I alone looked it up on cpan, means that somebody would make use of this.

      When you get down to it, every module on cpan is just a simple match. It just happens to be a simple match that I don't have to write. :-)

      Did you actually look at Data::RandomPerson? Look at the source code of one of these modules.

      This proves that the module I am searching for, would have been base for Data::RandomPerson. Obviously Peter Hickman (bless his heart!) did not foresee the data to be useful for anything else, and thus harcoded it outside of a named exportable symbol.

      Not to mention O'Reilly, chromatic, Grytpype-Thynne or Cowboy Neal.
Re: missing simple 'is this a name' module on cpan?
by Your Mother (Archbishop) on Feb 18, 2009 at 17:08 UTC

    Unless you live in a place like Spain or Italy where strange names are illegal, names are arbitrary (or at least as arbitrary as a given county clerk and judge are in the mood to tolerate). For example, O(+>. So this is not a simple problem but one on a function approaching impossible.

    Something like an interface to a flat census DB distributed with a module might be nice though. It would be useful in many cases. Maybe you could give it a stab?

    (Update: and speaking as someone with a mildly-unusual-but-entirely-within-traditional-rules name which is nearly always mangled in subscriptions, billing statements, etc, I have to say that the automatic solutions only end up being annoying to the customer).

Re: missing simple 'is this a name' module on cpan?
by graff (Chancellor) on Feb 19, 2009 at 03:56 UTC
    Do any of the following strings look like names to you? When you ignore capitalization (which you would tend to do when looking at file names), do any of them look like they might not be names?

    Black, White, Green, Brown, May, Will, Mill, Hill, Bill, Hall, Wall, Fields, Woods, Forest, Frank, Earnest, Jewel, Ruby, Gold, Bond, Chuck, Pat, Sales, Miles, Mark, Irons, Steel, Rod, Reed, Robin, Lark, Singer, Ash, Birch, Lily, Iris, Rose, Drew, Dawn, Eve, Spring, Winter, Summer, Autumn, Laurel, Hardy, Abbot, Burns, Grace, Dolly.

    Maybe what you really want is to run your file names through a set of procedures that will:

    • list possible segmentations of a file name into two or more English words (Lingua::EN::Splitter might help with a limited set of cases)
    • assign possible Part-Of-Speech tags for the whole file name and for each word in each possible segmented sequence (cf. Lingua::EN::Tagger)

    I don't know about that "Tagger" module, but most English POS taggers will provide the label "proper noun" where appropriate -- given that "appropriate" is based on a statistical likelihood. A really good tagger would return an N-best list rather than just a single "most likely" answer. (Do check out available resources beyond CPAN for POS taggers.)

    Or maybe you want to try Lingua::EN::NamedEntity?

    What you want to do with all that possible/probable information is another question. The point is, you are looking at a very complicated problem that often poses a challenge to native speakers (who are much better at it than perl scripts, but even so cannot be perfect). You probably could have coded something quickly that might have handled some majority percentage of cases correctly (e.g. 65% or so), but getting beyond that range would take considerably longer.

    (Disclaimer: I haven't personally used any of the modules cited above. Some or all of them might be totally unsuitable for your task.)

Re: missing simple 'is this a name' module on cpan?
by doom (Deacon) on Feb 18, 2009 at 21:26 UTC

    Well, this CPAN module might be part of your solution: Lingua::EN::MatchNames

    But as other people have pointed out, you're going to have deal with a substantial amount of ambiguity about what a valid human name looks like.

    If you do some web searches, you can find some on-line white pages like this one, for example: Maryland Government Employees. You could use LWP to feed candidate strings into this form, and if it returns some hits, you know it's not too unusual as a human name.

    Using this algorithm, I can confirm that "Brenner" is a valid name, which is a relief.