comment on

A lot of background info here, the real question is, what do people think of the algorithm i'm about to describe, (including the hash of arrayrefs).
I'm working on a name parser, where the name information is really ownership information. This is not very clean, in the sense that "people names" and other name-related info, such as business names, and ownership information are mixed together. For simplicity, lets just say im trying to identify if a string is "people name(s)" or other info
So, i have a list of keywords that deonte non-people names, such as "Associates, Corp, Trust, Rental", etc. Now, the problem is that the name info im trying to parse is pretty ugly, in that there are often (non-standard) abbreviations, particularly in the business words. So, I've tried Lingua::Stem, with limited success, as well as some home-cooked algorithms. I want to try applying some Levenshtein like logic also, but this needs to be fast, as i process millions of names, and what is being described here is only a (logically) small part of the process.
So, i'm thinking of storing my keyword list as follows. Since each keyword will be at least 3 chars, a hash of arrayrefs seems good, where the top-level keys are the first 2 letters of each keyword, and the arrayref associated contains the remaining letters of the words. for example:

my %list = (
   "ass"   =>   ["ociation", "iates", "embly"],
   "res"   =>   ["ource", "tatement", "taurant"],
   etc
);
[download]

With this, i can substring the beginning of each token in the name string, and use %list to find possible keywords matches, then use a Levenshtein distance measure (and other algorithms) to determine the likelihood of what the information is supposed to represent.

A cool thing that hashes do not do, would be to allow a regex for a key. This way, one wouldnt have to cycle through the keys, and apply the regex to each. The Perl internal hashing algorithm could efficiently apply the regexing, so we could do something like:

my @matches = $hash_var{/^res/};
[download]

In reply to text string approxiamtions (concept for review) by shemp

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.