pc2 has asked for the wisdom of the Perl Monks concerning the following question:

salutations, we have an array with 274947 items (generated from a text file with 274947 lines and 2.235 MB), and we need to find items of this array that match some pattern (or several patterns). basically, each item of the array is a word, in simple text form (so the array would be a word list). we can't use hashes, because, as far as we know, hashes don't have capability of searching for patterns with regexps. we also tried connecting to an Access database, but it gets too slow. here's the best we've got so far:
#!c:/perl/bin/perl print "Content-type: text/html\n\n"; open BAP, "db.txt"; @biff = (); while (<BAP>){ if (/foo/){ push(@biff, $_)}; if (/bar/){ push(@biff, $_)}; if (/baz/){ push(@biff, $_)}; ... } print "@biff"; close BAP;
as you can see, it loops through the db.txt file, adding the lines to @biff when it matches a certain regexp. matching all lines to one single pattern is fairly fast, but matching to many patterns is very slow, because we need to grep several times. we've also tried the approach @biff = grep(/regexp/, @db) (repeating this line for every pattern, or putting all patterns in one regexp separated by the | operator), but it's as slow as the other technique. any suggestion for a faster way of grepping? or maybe even a very different approach other than storing the list in arrays (but still allowing us to search for patterns in a fast way)? salutations,

Replies are listed 'Best First'.
Re: parsing a very large array with regexps
by moritz (Cardinal) on Aug 19, 2007 at 17:32 UTC
Re: parsing a very large array with regexps
by BrowserUk (Patriarch) on Aug 19, 2007 at 18:44 UTC
Re: parsing a very large array with regexps
by quester (Vicar) on Aug 19, 2007 at 17:34 UTC
    Have you tried the regexp alternation operator "|" to do the search with a single regexp,
    @biff = grep(/foo|bar|baz/, @db)
    This may help somewhat by reducing the number of times the regexp engine needs to be started up.
      salutations, actually, we have already tried to search with a single regexp containing several patterns separated by the | operator, but it still takes too long.

        I was going to suggest Regexp::Assemble as a way of producing an optimised regexp, but if even a single regexp takes too much time then that approach is not worth persuing.

        If there is only a small population of sets of regexps used, then you should investigate caching (precomputing) the result sets of what different sets of regexp achieve when applied to the source list. Then, when you know that regexps A, B, D and G are called for, go and fetch the results that correspond to those regexps.

        If the regexps are truly arbitrary, then you have little choice but to pay the full cost each time (possibly caching the result in case it is called for again later on).

        • another intruder with the mooring in the heart of the Perl

        Oh. I think I see the problem. How many patterns are you searching for, what do the patterns actually look like, and how long is "too long"? With a 2 GHz Pentium IV, I can search a 2.2 Mb file with 274K lines for three five character patterns in about 1.35 seconds, and thirteen in 2.00 seconds (each additional pattern adds around .065 seconds or so.)

        Those numbers are for simple patterns like /abcde|bdfhj|acegi/. If your patterns are more like

        /a[bc]+d|[^efg](h|i)|klm*n+/
        each pattern will might take on the order of ten times as long.

        If you have any pathological patterns with nested * or + qualifiers, things can get very slow indeed.

Re: parsing a very large array with regexps
by fmerges (Chaplain) on Aug 19, 2007 at 18:05 UTC

    Hi,

    It depends on your needs, but basically if you need to do this kind of stuff, I would really thing about indexing the data.

    You are not talking the nature of the data, neither where it comes from, mean, if it can be indexed on a RDBMS, if it is already there; or simply on the filesystem, indexing it in a file for later extracting data from it.

    Take a look at Swish-e, I know they're also other alternatives... but... it could be enough for your needs.

    Regards,

    fmerges at irc.freenode.net
Re: parsing a very large array with regexps
by FunkyMonk (Bishop) on Aug 19, 2007 at 17:35 UTC
    Perhaps I've misunderstood, but why can't you just use @biff = grep { /foo|bar|baz/ } <BAP>;?

      salutations, yes, we can use this approach, but this is just a shortening of the original approach, and thus also takes long.
Re: parsing a very large array with regexps
by pc2 (Beadle) on Aug 19, 2007 at 19:55 UTC
    salutations,

    thank you for the responses.

    here is an example of pattern:

    suppose we have an user input, "shivâshvah", which means "Shiva's horse" in Sanskrit language. the program divides the word by commas (indicating possible morphemes) in several possibilities: "shivâshvah,"; "shivâshv,ah", "shivâ,shvah"... and, for each division possibility, it checks for the existence, in the array, of each morpheme of the divided word, and it does so by the use of patterns; for example, the division "shivâ,shvah", will be analyzed as "shiv(â|a),(a*)shvah", and then "shiv(â|a)" and "(a*)shvah" will be both checked in the array. is it enough to get an idea of the number of patterns? because, as it deals with a different language, it's difficult to explain.

    How many patterns? basically, there will be as much patterns as there are morphemes for each division possibility of the user's input.

    are they the same every time? no, there are as many different patterns as there are morpheme possibilities to be analyzed.

    How often do you do this? pattern matching is done for every morpheme of every division possibility (shivâshva,h; shiv,âshv,ah; ...) of the user's input (which would be shivâshvah in the case).

    How slow is "too slow"?

    well, when we run a very basic test with an array of 274947 items and 16 (test) patterns (acting like different user inputs), it took about 7 seconds. but imagine it for, like, 4000 patterns... a word like shivâshvah (treating sh as a single character) would have 255 morpheme division possibilities from "shivâshvah," to "sh,i,v,â,sh,v,a,h", and, for "sh,i,v,â,sh,v,a,h" -> "sh,i,v,(â|a),(â*|a*)sh,v,a,h", 8 words would have to be checked in the array, ONLY for this division possibility.

    salutations,
      Sorry, but I'm still confused by your description (and I work in this field, so I should be able to understand). The 274947 items are "words" of a language (e.g. drawn from some collection of text data), the user input is one or more "patterns" (e.g. a string or regex representing a stem or affix morpheme), and the task is to find and list the words that match the patterns. Do I have that right?

      But then you talk about a word like "shivâshvah" having 255 possible divisions into morphemes. Obviously, most of these possibilities would be ridiculous as potential linguistic analyses. I don't understand how this sort of arithmetic is relevant to the task.

      If the ultimate goal is a query engine that allows a user to specify some sort of citation form for a stem or affix morpheme, and returns the list of vocabulary items that contain that morpheme, then you need a database in which the vocabulary items are indexed according to the morphemes the make up each word.

      That is, the morphological analysis of each word in the vocabulary needs to be done "offline" -- once, as a separate project, in advance of actually starting up the query engine (and probably with some amount of manual effort to handle the many irregular forms) -- and the results of that analysis must be stored in a database in such a way that a query on a given morpheme will quickly produce the list of word forms known to contain that morpheme.

      This would be a job for a relational database, to handle the many-to-many relations among morphemes and word forms.

      If you are looking for ways to do the prerequisite analysis of all the vocabulary items, that problem is quite different from what you seem to be describing, and it involves some highly speculative (and measurably inaccurate) techniques in the field of automatic machine learning algorithms. Manual annotation by linguistically trained speakers of the language is still a necessity.

      Hi,

      I would really think about if it's not better and faster if possible for you guys to make the project "open". Basically what I see is that you're still working on the same problem of the previous posts... and that there's still this lack of information you're giving, basically you would get a better solution if people could really help you on the project.

      Looking at the previous posts and also to this one, I can see that you are really doing it as CGI scripts with code and presentation in one (spaghetti code)... I think is better to research a little bit on topics like separation of concern and also data mining, etc..., and really having a model or engine to which you feed the input and get an output which is independent from the Web.

      Regards,

      fmerges at irc.freenode.net
      From your description, let me make a simpler example and explain a possibility. Suppose I have a 4 "morpheme" word, ABCD where I am simply denoting each morpheme by a letter. Then I can search for all that you asked by creating a monster alternation of the form
      /^(A|B|C|D|AB|BC|CD|ABC|BCD|ABCD)$/
      where each of the complete possibilities is written in one alternation. A more efficient way is to factor out what is common, so that you could write:
      /^(A(B(CD?)?)?|B(CD?)?|CD?|D)$/
      Note that there are only four possible starting points, and if they don't occur, you don't have to look any further.

      Similarly, in the example you gave, with 255 or more possible COMPLETE patterns to match, every one of them must start in only one of 9 possible ways, right? So you need to help the regexp engine by making the regexp more efficient, and that means factoring all the similar starting points into a single expression.

      I am quite sure that this approach should work for your case.

      Maybe you could build a (sorted) tree and do the search "manually", ie without regexes?

      Even if you only build one file for each possible first character you gain a factor of 26 (if you reduce it to latin alphabet).

Re: parsing a very large array with regexps
by swampyankee (Parson) on Aug 19, 2007 at 23:59 UTC

    I don't consider an array of about 275,000 items to be that big; "big" is on the order of millions, not a couple of hundred thousand.

    You need to provide some more information, such as are the words expected to be unique, are your searches case-insensitive, are regexen required, how often does this need to be done, etc. I seem to remember Jon Bentley did something akin to this in his book, Programming Pearls. I believe he was building a word list for a spell-checker. I can't remember the exact form of the data structure he used, so I'm not going into further detail.


    fixed markup


    emc

    Information about American English usage here and here.

    Any New York City or Connecticut area jobs? I'm currently unemployed.

Re: parsing a very large array with regexps
by pc2 (Beadle) on Aug 24, 2007 at 21:09 UTC

    salutations,

    yes, the items of the array are words of a language, but they are all the inflected forms of the words of the language. but the user input is not or more patterns; the user input is intended to be words which are together in a phrase (or in a compound word), but glued together by euphony (sandhi) rules, like in "shivâshvah" <- shiva(Shiva-compound)+ashvah(horse-nominative), in which "a+a" becomes a long "â".

    when we talk about 255 possibilities, they are only "possibilities" of division, which may be eliminated if one of the "possible" morphemes isn't found.