parsing a very large array with regexps

pc2 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: parsing a very large array with regexps by moritz (Cardinal) on Aug 19, 2007 at 17:32 UTC
Use alternatives: `if (/foo\|bar\|baz/) { push @biff, $_; }` (Update: fixed typo) Perl 6 in German -- Difficult Sudoku	[reply] [d/l]
Re: parsing a very large array with regexps by BrowserUk (Patriarch) on Aug 19, 2007 at 18:44 UTC
What do your patterns look like? (Real examples please). How many patterns? Are they the same every time? How often do you do this? How slow is "too slow"? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: parsing a very large array with regexps by quester (Vicar) on Aug 19, 2007 at 17:34 UTC
Have you tried the regexp alternation operator "\|" to do the search with a single regexp, `@biff = grep(/foo\|bar\|baz/, @db)` [download] This may help somewhat by reducing the number of times the regexp engine needs to be started up.	[reply] [d/l]
Re^2: parsing a very large array with regexps by pc2 (Beadle) on Aug 19, 2007 at 17:39 UTC
salutations, actually, we have already tried to search with a single regexp containing several patterns separated by the \| operator, but it still takes too long.	[reply]
Re^3: parsing a very large array with regexps by grinder (Bishop) on Aug 19, 2007 at 17:54 UTC
I was going to suggest Regexp::Assemble as a way of producing an optimised regexp, but if even a single regexp takes too much time then that approach is not worth persuing. If there is only a small population of sets of regexps used, then you should investigate caching (precomputing) the result sets of what different sets of regexp achieve when applied to the source list. Then, when you know that regexps A, B, D and G are called for, go and fetch the results that correspond to those regexps. If the regexps are truly arbitrary, then you have little choice but to pay the full cost each time (possibly caching the result in case it is called for again later on). • another intruder with the mooring in the heart of the Perl	[reply]
Re^3: parsing a very large array with regexps by quester (Vicar) on Aug 19, 2007 at 18:53 UTC
Oh. I think I see the problem. How many patterns are you searching for, what do the patterns actually look like, and how long is "too long"? With a 2 GHz Pentium IV, I can search a 2.2 Mb file with 274K lines for three five character patterns in about 1.35 seconds, and thirteen in 2.00 seconds (each additional pattern adds around .065 seconds or so.) Those numbers are for simple patterns like /abcde\|bdfhj\|acegi/. If your patterns are more like `/a[bc]+d\|[^efg](h\|i)\|klmn+/` [download] each pattern will might take on the order of ten times as long. If you have any pathological patterns with nested or + qualifiers, things can get very slow indeed.	[reply] [d/l]
Re: parsing a very large array with regexps by fmerges (Chaplain) on Aug 19, 2007 at 18:05 UTC
Hi, It depends on your needs, but basically if you need to do this kind of stuff, I would really thing about indexing the data. You are not talking the nature of the data, neither where it comes from, mean, if it can be indexed on a RDBMS, if it is already there; or simply on the filesystem, indexing it in a file for later extracting data from it. Take a look at Swish-e, I know they're also other alternatives... but... it could be enough for your needs. Regards, fmerges at irc.freenode.net	[reply]
Re: parsing a very large array with regexps by FunkyMonk (Bishop) on Aug 19, 2007 at 17:35 UTC
Perhaps I've misunderstood, but why can't you just use `@biff = grep { /foo\|bar\|baz/ } <BAP>;`?	[reply] [d/l] [select]
Re^2: parsing a very large array with regexps by pc2 (Beadle) on Aug 19, 2007 at 17:42 UTC
salutations, yes, we can use this approach, but this is just a shortening of the original approach, and thus also takes long.	[reply]
Re: parsing a very large array with regexps by pc2 (Beadle) on Aug 19, 2007 at 19:55 UTC
salutations, thank you for the responses. here is an example of pattern: suppose we have an user input, "shivāshvah", which means "Shiva's horse" in Sanskrit language. the program divides the word by commas (indicating possible morphemes) in several possibilities: "shivāshvah,"; "shivāshv,ah", "shivā,shvah"... and, for each division possibility, it checks for the existence, in the array, of each morpheme of the divided word, and it does so by the use of patterns; for example, the division "shivā,shvah", will be analyzed as "shiv(ā\|a),(a)shvah", and then "shiv(ā\|a)" and "(a)shvah" will be both checked in the array. is it enough to get an idea of the number of patterns? because, as it deals with a different language, it's difficult to explain. How many patterns? basically, there will be as much patterns as there are morphemes for each division possibility of the user's input. are they the same every time? no, there are as many different patterns as there are morpheme possibilities to be analyzed. How often do you do this? pattern matching is done for every morpheme of every division possibility (shivāshva,h; shiv,āshv,ah; ...) of the user's input (which would be shivāshvah in the case). How slow is "too slow"? well, when we run a very basic test with an array of 274947 items and 16 (test) patterns (acting like different user inputs), it took about 7 seconds. but imagine it for, like, 4000 patterns... a word like shivāshvah (treating sh as a single character) would have 255 morpheme division possibilities from "shivāshvah," to "sh,i,v,ā,sh,v,a,h", and, for "sh,i,v,ā,sh,v,a,h" -> "sh,i,v,(ā\|a),(ā\|a)sh,v,a,h", 8 words would have to be checked in the array, ONLY for this division possibility. salutations,	[reply]
Re^2: parsing a very large array with regexps by graff (Chancellor) on Aug 20, 2007 at 01:46 UTC
Sorry, but I'm still confused by your description (and I work in this field, so I should be able to understand). The 274947 items are "words" of a language (e.g. drawn from some collection of text data), the user input is one or more "patterns" (e.g. a string or regex representing a stem or affix morpheme), and the task is to find and list the words that match the patterns. Do I have that right? But then you talk about a word like "shivāshvah" having 255 possible divisions into morphemes. Obviously, most of these possibilities would be ridiculous as potential linguistic analyses. I don't understand how this sort of arithmetic is relevant to the task. If the ultimate goal is a query engine that allows a user to specify some sort of citation form for a stem or affix morpheme, and returns the list of vocabulary items that contain that morpheme, then you need a database in which the vocabulary items are indexed according to the morphemes the make up each word. That is, the morphological analysis of each word in the vocabulary needs to be done "offline" -- once, as a separate project, in advance of actually starting up the query engine (and probably with some amount of manual effort to handle the many irregular forms) -- and the results of that analysis must be stored in a database in such a way that a query on a given morpheme will quickly produce the list of word forms known to contain that morpheme. This would be a job for a relational database, to handle the many-to-many relations among morphemes and word forms. If you are looking for ways to do the prerequisite analysis of all the vocabulary items, that problem is quite different from what you seem to be describing, and it involves some highly speculative (and measurably inaccurate) techniques in the field of automatic machine learning algorithms. Manual annotation by linguistically trained speakers of the language is still a necessity.	[reply]
Re^2: parsing a very large array with regexps by fmerges (Chaplain) on Aug 19, 2007 at 20:45 UTC
Hi, I would really think about if it's not better and faster if possible for you guys to make the project "open". Basically what I see is that you're still working on the same problem of the previous posts... and that there's still this lack of information you're giving, basically you would get a better solution if people could really help you on the project. Looking at the previous posts and also to this one, I can see that you are really doing it as CGI scripts with code and presentation in one (spaghetti code)... I think is better to research a little bit on topics like separation of concern and also data mining, etc..., and really having a model or engine to which you feed the input and get an output which is independent from the Web. Regards, fmerges at irc.freenode.net	[reply]
Re^2: parsing a very large array with regexps by b4swine (Pilgrim) on Aug 20, 2007 at 06:50 UTC
From your description, let me make a simpler example and explain a possibility. Suppose I have a 4 "morpheme" word, ABCD where I am simply denoting each morpheme by a letter. Then I can search for all that you asked by creating a monster alternation of the form `/^(A\|B\|C\|D\|AB\|BC\|CD\|ABC\|BCD\|ABCD)$/` [download] where each of the complete possibilities is written in one alternation. A more efficient way is to factor out what is common, so that you could write: `/^(A(B(CD?)?)?\|B(CD?)?\|CD?\|D)$/` [download] Note that there are only four possible starting points, and if they don't occur, you don't have to look any further. Similarly, in the example you gave, with 255 or more possible COMPLETE patterns to match, every one of them must start in only one of 9 possible ways, right? So you need to help the regexp engine by making the regexp more efficient, and that means factoring all the similar starting points into a single expression. I am quite sure that this approach should work for your case.	[reply] [d/l] [select]
Re^2: parsing a very large array with regexps by moritz (Cardinal) on Aug 20, 2007 at 06:15 UTC
Maybe you could build a (sorted) tree and do the search "manually", ie without regexes? Even if you only build one file for each possible first character you gain a factor of 26 (if you reduce it to latin alphabet). Perl 6 in German -- Difficult Sudoku	[reply]
Re: parsing a very large array with regexps by swampyankee (Parson) on Aug 19, 2007 at 23:59 UTC
I don't consider an array of about 275,000 items to be that big; "big" is on the order of millions, not a couple of hundred thousand. You need to provide some more information, such as are the words expected to be unique, are your searches case-insensitive, are regexen required, how often does this need to be done, etc. I seem to remember Jon Bentley did something akin to this in his book, Programming Pearls. I believe he was building a word list for a spell-checker. I can't remember the exact form of the data structure he used, so I'm not going into further detail. fixed markup emc Information about American English usage here and here. Any New York City or Connecticut area jobs? I'm currently unemployed.	[reply]
Re: parsing a very large array with regexps by pc2 (Beadle) on Aug 24, 2007 at 21:09 UTC
salutations, yes, the items of the array are words of a language, but they are all the inflected forms of the words of the language. but the user input is not or more patterns; the user input is intended to be words which are together in a phrase (or in a compound word), but glued together by euphony (sandhi) rules, like in "shivāshvah" <- shiva(Shiva-compound)+ashvah(horse-nominative), in which "a+a" becomes a long "ā". when we talk about 255 possibilities, they are only "possibilities" of division, which may be eliminated if one of the "possible" morphemes isn't found.	[reply]