in reply to Re: parsing a very large array with regexps
in thread parsing a very large array with regexps

Sorry, but I'm still confused by your description (and I work in this field, so I should be able to understand). The 274947 items are "words" of a language (e.g. drawn from some collection of text data), the user input is one or more "patterns" (e.g. a string or regex representing a stem or affix morpheme), and the task is to find and list the words that match the patterns. Do I have that right?

But then you talk about a word like "shivâshvah" having 255 possible divisions into morphemes. Obviously, most of these possibilities would be ridiculous as potential linguistic analyses. I don't understand how this sort of arithmetic is relevant to the task.

If the ultimate goal is a query engine that allows a user to specify some sort of citation form for a stem or affix morpheme, and returns the list of vocabulary items that contain that morpheme, then you need a database in which the vocabulary items are indexed according to the morphemes the make up each word.

That is, the morphological analysis of each word in the vocabulary needs to be done "offline" -- once, as a separate project, in advance of actually starting up the query engine (and probably with some amount of manual effort to handle the many irregular forms) -- and the results of that analysis must be stored in a database in such a way that a query on a given morpheme will quickly produce the list of word forms known to contain that morpheme.

This would be a job for a relational database, to handle the many-to-many relations among morphemes and word forms.

If you are looking for ways to do the prerequisite analysis of all the vocabulary items, that problem is quite different from what you seem to be describing, and it involves some highly speculative (and measurably inaccurate) techniques in the field of automatic machine learning algorithms. Manual annotation by linguistically trained speakers of the language is still a necessity.

  • Comment on Re^2: parsing a very large array with regexps