Re^4: Does Search::InvertedIndex module live up to billing?

Perrin, you're welcome to be harsh - I'm pretty thick skinned and prefer to learn than be coddled.

And I really appreciate the time put into that reponse, Corrin. But I suspect I've not expressed myself correctly.

I have been given a datafile that contains two pieces of information for each respondent: respondent id, and a list of response id's that respondent chose:

respondent_1: 123, 456, 789, 159, 753, ...<br>
respondent_2: 753, 654, 987, 321, 963, ...<br>
..etc...
[download]

Respondent id's go from 1 to 20,000. Response id's go from 1 to 3,000. On average each respondent would have chosen about 1/5 of possible responses, so the average respondent's record is ~600 response id's long. (Note that that means the average response was chosen by about 1/5 of respondents.) That's what I've got to work with.

The objective is to generate reports for:

count respondents who gave response x and response y
[download]

where it may be just x and y, or it may be x and y and z and w as a percentage of respondents who gave responses a and b and c, and a single report is made of a few hundred such queries. These reports are being generated on the fly (too many millions of combinations and permutations to pre-generate them), so query speed is paramount. The data never changes once loaded, so updates, inserts, deletions, etc are of no concern.

So, I don't have the information to create these other tables. I might be able to get it, but I'd like to do the best with what I've got before asking for information I know will be hard to acquire.

I went to the full 3,000 columns with the idea of making a column for each possible response id, making columns boolean tinyint(1), then doing

select count(*) from theTable where (response_id_x=1 and response_id_y
+=1);
[download]

With the respondent column as primary key, it's averaging ~0.25 seconds per query today (using Benchmark module to measure). And, as I think we all concur, 3,000 indexes is not practicable.

I've also tried a two-column structure:

respondent_id, response_id
[download]

with a compound primary key. This means about 600 x 20,000 = 12-million rows. Doing queries of the form

select count(*) from theTable A inner join theTable B on (A.respondent
+ = B.respondent) where (A.response_id = x and B.response_id = y);
[download]

on this gives me ~0.15 sec per query execution. I tried removing the join out to Perl by doing two separate queries on the response id's and then finding the intersection of the two sets, but tests eventually showed no speed difference between the two with optimize code.

Letting my experimentation wander into the ridiculous, I also tried a unique inverted index, using one single-column table for each response_id, using that column as primary key, and filling with respondent_id's who had given that reponse. That means each table is going to average 20,000/5 =~ 4,000 rows to join. Thus, the query became:

SELECT COUNT(*) FROM table_X JOIN table_y ON (table_x.respondent = tab
+le_y.respondent)
[download]

and this got me down to ~0.07 sec per query execution. Not bad speed-wise, but bad design, I know. In this case, moving the join out to a Perl intersection routine was measurably slower.

Because I know that's bad design, I'm looking for alternatives within the limited framework of my problem that are better design but as fast or faster. Hence my drifting into thinking of other inverted index possibilities.

By the way - I'm posting this here at PM 'cause the whole thing is built on Perl DBI.

Comment on Re^4: Does Search::InvertedIndex module live up to billing? Select or Download Code

Replies are listed 'Best First'.
Re^5: Does Search::InvertedIndex module live up to billing? by perrin (Chancellor) on Oct 21, 2004 at 20:22 UTC
Your compound primary key approach is the right one. Can you show the DDL you used to create the table and set up indexes? Did you try running the query analyzer to see the query plan and make sure your indexes were being used?	[reply]
Re^6: Does Search::InvertedIndex module live up to billing? by punch_card_don (Curate) on Oct 21, 2004 at 20:49 UTC
Sure thing: CREATE TABLE `theTable` ( `response_id` smallint(5) unsigned NOT NULL +default '0', `respondent_id` smallint(5) unsigned NOT NULL default '0 +', PRIMARY KEY (`response_id`,`respondent_id`) ) TYPE=MyISAM [download] I'm afraid I'm not familiar with this infamous query analyzer - so I'll read up on it on the web.	[reply] [d/l]
Re^7: Does Search::InvertedIndex module live up to billing? by perrin (Chancellor) on Oct 21, 2004 at 21:26 UTC
You are joining on respondent_id and searching on response_id so you need indexes on both of those. KEY (response_id), KEY (respondent_id)	[reply]
Re^8: Does Search::InvertedIndex module live up to billing? by punch_card_don (Curate) on Oct 21, 2004 at 21:31 UTC
Re^9: Does Search::InvertedIndex module live up to billing? by perrin (Chancellor) on Oct 21, 2004 at 22:26 UTC
Re^7: Does Search::InvertedIndex module live up to billing? by punch_card_don (Curate) on Oct 21, 2004 at 21:26 UTC
Well, I found and tried the EXPLAIN command on the two-column query. Here's what it gave me: +-------+--------+---------------+---------+---------+---------------- +----+------+--------------------------+ \| table \| type \| possible_keys \| key \| key_len \| ref + \| rows \| Extra \| +-------+--------+---------------+---------+---------+---------------- +----+------+--------------------------+ \| B \| ref \| PRIMARY \| PRIMARY \| 2 \| const + \| 3007 \| Using where; Using index \| \| A \| eq_ref \| PRIMARY \| PRIMARY \| 4 \| const,B.respond +ent \| 1 \| Using where; Using index \| +-------+--------+---------------+---------+---------+---------------- +----+------+--------------------------+ [download] Unfortunately I have no idea what this means. More reading... Edit by castaway - swapped pre tags for code tags	[reply] [d/l]
Re^5: Does Search::InvertedIndex module live up to billing? by Roy Johnson (Monsignor) on Oct 21, 2004 at 21:12 UTC
For the two-column solution, in which order did you define the columns of your Primary Key? Did you try it response_id first? Caution: Contents may have been coded under pressure.	[reply]