in reply to parsing a very large array with regexps

salutations,

thank you for the responses.

here is an example of pattern:

suppose we have an user input, "shivâshvah", which means "Shiva's horse" in Sanskrit language. the program divides the word by commas (indicating possible morphemes) in several possibilities: "shivâshvah,"; "shivâshv,ah", "shivâ,shvah"... and, for each division possibility, it checks for the existence, in the array, of each morpheme of the divided word, and it does so by the use of patterns; for example, the division "shivâ,shvah", will be analyzed as "shiv(â|a),(a*)shvah", and then "shiv(â|a)" and "(a*)shvah" will be both checked in the array. is it enough to get an idea of the number of patterns? because, as it deals with a different language, it's difficult to explain.

How many patterns? basically, there will be as much patterns as there are morphemes for each division possibility of the user's input.

are they the same every time? no, there are as many different patterns as there are morpheme possibilities to be analyzed.

How often do you do this? pattern matching is done for every morpheme of every division possibility (shivâshva,h; shiv,âshv,ah; ...) of the user's input (which would be shivâshvah in the case).

How slow is "too slow"?

well, when we run a very basic test with an array of 274947 items and 16 (test) patterns (acting like different user inputs), it took about 7 seconds. but imagine it for, like, 4000 patterns... a word like shivâshvah (treating sh as a single character) would have 255 morpheme division possibilities from "shivâshvah," to "sh,i,v,â,sh,v,a,h", and, for "sh,i,v,â,sh,v,a,h" -> "sh,i,v,(â|a),(â*|a*)sh,v,a,h", 8 words would have to be checked in the array, ONLY for this division possibility.

salutations,
  • Comment on Re: parsing a very large array with regexps

Replies are listed 'Best First'.
Re^2: parsing a very large array with regexps
by graff (Chancellor) on Aug 20, 2007 at 01:46 UTC
    Sorry, but I'm still confused by your description (and I work in this field, so I should be able to understand). The 274947 items are "words" of a language (e.g. drawn from some collection of text data), the user input is one or more "patterns" (e.g. a string or regex representing a stem or affix morpheme), and the task is to find and list the words that match the patterns. Do I have that right?

    But then you talk about a word like "shivâshvah" having 255 possible divisions into morphemes. Obviously, most of these possibilities would be ridiculous as potential linguistic analyses. I don't understand how this sort of arithmetic is relevant to the task.

    If the ultimate goal is a query engine that allows a user to specify some sort of citation form for a stem or affix morpheme, and returns the list of vocabulary items that contain that morpheme, then you need a database in which the vocabulary items are indexed according to the morphemes the make up each word.

    That is, the morphological analysis of each word in the vocabulary needs to be done "offline" -- once, as a separate project, in advance of actually starting up the query engine (and probably with some amount of manual effort to handle the many irregular forms) -- and the results of that analysis must be stored in a database in such a way that a query on a given morpheme will quickly produce the list of word forms known to contain that morpheme.

    This would be a job for a relational database, to handle the many-to-many relations among morphemes and word forms.

    If you are looking for ways to do the prerequisite analysis of all the vocabulary items, that problem is quite different from what you seem to be describing, and it involves some highly speculative (and measurably inaccurate) techniques in the field of automatic machine learning algorithms. Manual annotation by linguistically trained speakers of the language is still a necessity.

Re^2: parsing a very large array with regexps
by fmerges (Chaplain) on Aug 19, 2007 at 20:45 UTC

    Hi,

    I would really think about if it's not better and faster if possible for you guys to make the project "open". Basically what I see is that you're still working on the same problem of the previous posts... and that there's still this lack of information you're giving, basically you would get a better solution if people could really help you on the project.

    Looking at the previous posts and also to this one, I can see that you are really doing it as CGI scripts with code and presentation in one (spaghetti code)... I think is better to research a little bit on topics like separation of concern and also data mining, etc..., and really having a model or engine to which you feed the input and get an output which is independent from the Web.

    Regards,

    fmerges at irc.freenode.net
Re^2: parsing a very large array with regexps
by b4swine (Pilgrim) on Aug 20, 2007 at 06:50 UTC
    From your description, let me make a simpler example and explain a possibility. Suppose I have a 4 "morpheme" word, ABCD where I am simply denoting each morpheme by a letter. Then I can search for all that you asked by creating a monster alternation of the form
    /^(A|B|C|D|AB|BC|CD|ABC|BCD|ABCD)$/
    where each of the complete possibilities is written in one alternation. A more efficient way is to factor out what is common, so that you could write:
    /^(A(B(CD?)?)?|B(CD?)?|CD?|D)$/
    Note that there are only four possible starting points, and if they don't occur, you don't have to look any further.

    Similarly, in the example you gave, with 255 or more possible COMPLETE patterns to match, every one of them must start in only one of 9 possible ways, right? So you need to help the regexp engine by making the regexp more efficient, and that means factoring all the similar starting points into a single expression.

    I am quite sure that this approach should work for your case.

Re^2: parsing a very large array with regexps
by moritz (Cardinal) on Aug 20, 2007 at 06:15 UTC
    Maybe you could build a (sorted) tree and do the search "manually", ie without regexes?

    Even if you only build one file for each possible first character you gain a factor of 26 (if you reduce it to latin alphabet).