in reply to missing simple 'is this a name' module on cpan?

Do any of the following strings look like names to you? When you ignore capitalization (which you would tend to do when looking at file names), do any of them look like they might not be names?

Black, White, Green, Brown, May, Will, Mill, Hill, Bill, Hall, Wall, Fields, Woods, Forest, Frank, Earnest, Jewel, Ruby, Gold, Bond, Chuck, Pat, Sales, Miles, Mark, Irons, Steel, Rod, Reed, Robin, Lark, Singer, Ash, Birch, Lily, Iris, Rose, Drew, Dawn, Eve, Spring, Winter, Summer, Autumn, Laurel, Hardy, Abbot, Burns, Grace, Dolly.

Maybe what you really want is to run your file names through a set of procedures that will:

I don't know about that "Tagger" module, but most English POS taggers will provide the label "proper noun" where appropriate -- given that "appropriate" is based on a statistical likelihood. A really good tagger would return an N-best list rather than just a single "most likely" answer. (Do check out available resources beyond CPAN for POS taggers.)

Or maybe you want to try Lingua::EN::NamedEntity?

What you want to do with all that possible/probable information is another question. The point is, you are looking at a very complicated problem that often poses a challenge to native speakers (who are much better at it than perl scripts, but even so cannot be perfect). You probably could have coded something quickly that might have handled some majority percentage of cases correctly (e.g. 65% or so), but getting beyond that range would take considerably longer.

(Disclaimer: I haven't personally used any of the modules cited above. Some or all of them might be totally unsuitable for your task.)

  • Comment on Re: missing simple 'is this a name' module on cpan?