Does Search::InvertedIndex module live up to billing?

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Does Search::InvertedIndex module live up to billing? by perrin (Chancellor) on Oct 21, 2004 at 15:40 UTC
Actually, it's not very clear to me why you would want an inverted index for this. Inverted indexes are good for searching data that is NOT already organized into a structure. Typically, it would be for full-text searching. To answer questions like "how many responents chose answer C on question 6?" you would be better off with a standard databases. MySQL can handle tables with many millions of rows in them, so your data set doesn't sound like a problem to me at all.	[reply]
Re^2: Does Search::InvertedIndex module live up to billing? by punch_card_don (Curate) on Oct 21, 2004 at 16:34 UTC
Well, I've tried the 3,000 column by 20,000 row thing. respondent_id, response_1, response_2, ..... Searches of the form select count() from theTable where (response_1=1 and response_n=1); performs abysmally. Averages ~ 0.24 seconds per select execution. I'm looking for under 0.1 sec because I'm producing reports in real time that require a few hundred selects to do to produce a single report. I can use respondent _id as primary key, but each of the 3,000 columns is equally likely to be searched on and I can't index them all. I don't believe mysql will allow 3,000 indexes on a single table. So .... I got hinking about an inverted index so I could quickly retrieve the list of respondents who gave response x, the list of respondents who gave response y, and then compute their intersection. My first version of this was to have a separate table for each possible reponse and populate it with respondent id's. That improved performance to ~0.16 sec average. Of course, each table has its own index. If, instead, I had one large inverted index, response_id, respondent_1, respondent_2, ... I would only need to use the one response_id column as primary key, reducing total size and complexity of the database. In that case I have two options: put respondent id's in the columns and my selects become select from theTable where response_id = x use the response id's as column names and fill with 1s and 0s, but then I need a way to associate matches with their column names..... Soooo, I was thinking of trying the search engine trick and building a reverse index like Search::InvertedIndex. I'm very interested in ideas. Thanks.	[reply]
Re^3: Does Search::InvertedIndex module live up to billing? by Corion (Patriarch) on Oct 21, 2004 at 17:31 UTC
I'm not sure I understand your database schema. From how I understood it, I created the following SQL schema (in SQLite syntax), which is not denormalized, so it might be suboptimal, but it is as clean as I can imagine it. By "fk" I mean "foreign key", so `offered_choice.correct` should be a value that can be found in the `response.id` column. create table response ( -- models a possible response id integer primary key, text varchar ); create table question ( -- models a possible question id integer primary key, text varchar ); create table offered_choice ( -- models the offered choices to a question id integer primary key, question integer, choice integer -- fk into response ); create table questionnaire ( id integer primary key, position integer, choice_set integer, -- fk into offered_choice correct integer -- fk into response ); create table answer ( -- models a given answer id integer primary key, subject integer, questionlet integer, response integer ); create table subject ( -- models a person taking a test id integer primary key, name varchar() ); [download] With these tables filled, and indices at least on every primary key column, your queries become joins over multiple tables, but they should also not be too slow. Your dataset doesn't seem too large though, so I wonder what you did to actually optimize MySQL performance - MySQL can hold a (read-only) table completely in memory, so you should tell MySQL to do just that - this should speed up your queries by quite a lot too. On the other hand, I still wonder about your current database schema, because blindly throwing around 3000 tables, or 3000 indices doesn't make much sense to me. I'm also not sure if every subject got asked every question and got offered every possible response - my schema allows for mixing the responses offered to every question, but also allows for only one "correct" response to every questionlet.	[reply] [d/l]
Re^3: Does Search::InvertedIndex module live up to billing? by perrin (Chancellor) on Oct 21, 2004 at 18:19 UTC
I don't want to sound too harsh, but you really need to do some reading on relational database schema design. There is no problem I can think of that would require 3000 columns to represent it. Something more like Corion's design is what I would have done. With proper indexes, this should handle the relatively small dataset you have very efficiently.	[reply]
Re^4: Does Search::InvertedIndex module live up to billing? by punch_card_don (Curate) on Oct 21, 2004 at 19:45 UTC
Re^5: Does Search::InvertedIndex module live up to billing? by perrin (Chancellor) on Oct 21, 2004 at 20:22 UTC
Some notes below your chosen depth have not been shown here
Re^5: Does Search::InvertedIndex module live up to billing? by Roy Johnson (Monsignor) on Oct 21, 2004 at 21:12 UTC
Re^4: Does Search::InvertedIndex module live up to billing? by punch_card_don (Curate) on Oct 21, 2004 at 20:13 UTC
Re: Does Search::InvertedIndex module live up to billing? by tall_man (Parson) on Oct 21, 2004 at 23:34 UTC
20,000 rows of 3,000 bits is only about 7 megabytes. Another idea might be a bit array in memory that you could search by and-ing with bit masks. PDL might make fast manipulations easier.	[reply]
Re: Does Search::InvertedIndex module live up to billing? by punch_card_don (Curate) on Oct 22, 2004 at 14:38 UTC
G'day all - back at it. So I built a new database with the new keys: CREATE TABLE `responses` ( `respondent` smallint(5) unsigned NOT NULL +default '0', `response` smallint(5) unsigned NOT NULL default '0', PR +IMARY KEY (`respondent`,`response`), KEY `respondent` (`respondent`), + KEY `response` (`response`) ) TYPE=MyISAM [download] Populated it with my test data, and tried my search `SELECT COUNT(*) FROM theTable A JOIN theTable B ON (A.respondent = B.r +espondent) WHERE A.response = ? AND B.response = ?` [download] And it's averaging 0.2 seconds this morning while the previous two-column version is averaging 0.13 sec and the far-fetched one-table-per-response db is averaging 0.09 sec. Anyway, EXPLAIN on this query in this new database says +-------+--------+-----------------------------+----------+---------+--------------------+------+--------------------------+ \| table \| type \| possible_keys \| key \| key_len \| ref \| rows \| Extra \| +-------+--------+-----------------------------+----------+---------+--------------------+------+--------------------------+ \| A \| ref \| PRIMARY,respondent,response \| response \| 2 \| const \| 3372 \| Using where \| \| B \| eq_ref \| PRIMARY,respondent,response \| PRIMARY \| 4 \| A.respondent,const \| 1 \| Using where; Using index \| +-------+--------+-----------------------------+----------+---------+--------------------+------+--------------------------+ I suppose next I'll try reversing the order of the index declarations, but is that really going to give me the boost to more than double speed? Bit arrays and bit masks sound interesting, but Googling "bit array" & "bit mask" doesn't give me much. Any good reads on this somewhere?	[reply] [d/l] [select]
Re^2: Does Search::InvertedIndex module live up to billing? by punch_card_don (Curate) on Oct 22, 2004 at 15:02 UTC
Found the PDL pages....interesting...	[reply]