Alternative to querying a database to save time

smandape1 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Alternative to querying a database to save time by mbethke (Hermit) on Dec 29, 2011 at 05:28 UTC
You could join all the search terms into one big regexp: `my $re = join '\|', map { chomp; $_ } <FILE2>;` then fetch all the abstracts for each of your keys and run the regexp against each of them. Take care to precompile it with $re = qr/$re/. If you'd like the terms to only be found as whole words, something like $re = qr/\b(?:$re)\b/ should do. With that many terms, it's probably well worth using Regexp::Trie: `use Regexp::Trie; my $rt = Regexp::Trie->new; while(<FILE2>) { $rt->add($_); } $rt = $rt->regexp;` [download] You can still add the word boundary stuff as above then.	[reply] [d/l] [select]
Re: Alternative to querying a database to save time by BrowserUk (Patriarch) on Dec 29, 2011 at 09:08 UTC
I have around 58,00,000 primarykeys in FILE1 Is that 5.8 million or 58 million or ... ? And how many rows are there in your table? I am going to work on multiple such FILES. Do you mean pairs of files? Ie. will each new query consist of a different FILE1 & FILE2? Or the same FILE1 with a different FILE2? Or vice versa? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^2: Alternative to querying a database to save time by choroba (Cardinal) on Dec 29, 2011 at 10:46 UTC
Never heard about lakh? Or crore? Beautiful brain gymnastics.	[reply]
Re^3: Alternative to querying a database to save time by BrowserUk (Patriarch) on Dec 29, 2011 at 10:56 UTC
I've encountered lakhs a couple of times, but I'd never seen a number written that way. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^4: Alternative to querying a database to save time by choroba (Cardinal) on Dec 29, 2011 at 11:21 UTC
Re: Alternative to querying a database to save time by TJPride (Pilgrim) on Dec 29, 2011 at 07:38 UTC
You -could- do this with Perl, problem is you'll need to load all your data into memory to do efficient searching, and unless you want to keep it there (which will eat hundreds of MB of RAM minimum...), you'll have to recreate the hash every time you want to do a lookup, which probably won't be that fast either. I'd personally use a database if you're going to be doing more than a few searches per day or if your data updates often or if it has to be accurate in real time. If you search very infrequently and your data is more or less static, a Perl solution will work fine. As always, giving us samples of your data is going to help if you want specific queries / code.	[reply]
Re^2: Alternative to querying a database to save time by mbethke (Hermit) on Dec 29, 2011 at 15:09 UTC
You -could- do this with Perl, problem is you'll need to load all your data into memory to do efficient searching, and unless you want to keep it there (which will eat hundreds of MB of RAM minimum...), you'll have to recreate the hash every time you want to do a lookup, which probably won't be that fast either. As far as I understood, he just needs the set of abstracts from a static DB for every set of <file1, file2> that are both associated with one of the keys in file1 and match one of the terms in file2. To do that, he could fetch the abstracts for each of the keys, do the searching in Perl and depending on the result either output the abstract or just throw the record away. That way he only has to keep one record at a time in memory and would likely be faster than with MySQL's horribly inefficient regexen, not to mention one separate query per key times search terms. As always, giving us samples of your data is going to help if you want specific queries / code. True. The Stackoverflow link has some snippets though, if not the clearest.	[reply]
Re^3: Alternative to querying a database to save time by smandape1 (Acolyte) on Dec 29, 2011 at 19:36 UTC
Thank you mbethke. Yeah, this is exactly what I want to do. I'm following the code from the stackoverflow by stevenl. It works, but takes much longer time. One more thing worth to mention is my abstracts are full-text-indexed. And after searching in the abstract, I just want the associated primarykey and not the abstract. I have put the terms in an array and matching them with the abstract in mysql. Can you please help to modify the code. `my $against = join ' ', map {qq/"$_"/} @foo; while (my $PK = <FILE1>) { chomp $PK; my $sth = $dbh->prepare(qq{ SELECT value FROM value_table WHERE primary_key = ? # no '+' so it can match against at least one of the words i +n the list AND MATCH (column_text_indexed) AGAINST (? IN BOOLEAN MODE) }); $sth->execute($PK, $against);` [download] FILE1 has list of all primarykeys and FILE2 has the list of terms. Thank you for your help.	[reply] [d/l]
Re^4: Alternative to querying a database to save time by Anonymous Monk on Dec 29, 2011 at 20:34 UTC
Re^3: Alternative to querying a database to save time by smandape1 (Acolyte) on Dec 29, 2011 at 19:39 UTC
I am sorry. I didn't see your answer. I will try it.	[reply]
Re: Alternative to querying a database to save time by Anonymous Monk on Dec 29, 2011 at 08:05 UTC
I would try to do as much work in the database as possible. In this case, I would create a (temporary) table with a single column containing the PK values, and join on that (which effectively filters your results). However, you are using mysql and AFAIK that database is able to do only nested-loop joins instead of the join modes far more fitting for that sort of query, so this method might actually be worse performance-wise. If the PK text file doesn't change and you need to run this query repeatedly, it may be a convenient method. Adding a column to the table containing whether it is "allowed" may be acceptable, too. Your other option would be to emulate the hash join strategy with Perl's hashes and filter manually. There's nothing wrong with that, but it does require quite a bit of data transferred out of the database. Do note that if you do it a single record at a time, it takes little memory, as the largest thing you'll need is a hash made out of the PKs in FILE1. (Listen to the other posters, too; they make good points)	[reply]