Sorry, but I'm still confused by your description (and I work in this field, so I should be able to understand). The 274947 items are "words" of a language (e.g. drawn from some collection of text data), the user input is one or more "patterns" (e.g. a string or regex representing a stem or affix morpheme), and the task is to find and list the words that match the patterns. Do I have that right?

But then you talk about a word like "shivâshvah" having 255 possible divisions into morphemes. Obviously, most of these possibilities would be ridiculous as potential linguistic analyses. I don't understand how this sort of arithmetic is relevant to the task.

If the ultimate goal is a query engine that allows a user to specify some sort of citation form for a stem or affix morpheme, and returns the list of vocabulary items that contain that morpheme, then you need a database in which the vocabulary items are indexed according to the morphemes the make up each word.

That is, the morphological analysis of each word in the vocabulary needs to be done "offline" -- once, as a separate project, in advance of actually starting up the query engine (and probably with some amount of manual effort to handle the many irregular forms) -- and the results of that analysis must be stored in a database in such a way that a query on a given morpheme will quickly produce the list of word forms known to contain that morpheme.

This would be a job for a relational database, to handle the many-to-many relations among morphemes and word forms.

If you are looking for ways to do the prerequisite analysis of all the vocabulary items, that problem is quite different from what you seem to be describing, and it involves some highly speculative (and measurably inaccurate) techniques in the field of automatic machine learning algorithms. Manual annotation by linguistically trained speakers of the language is still a necessity.


In reply to Re^2: parsing a very large array with regexps by graff
in thread parsing a very large array with regexps by pc2

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.