in reply to Does Search::InvertedIndex module live up to billing?

Actually, it's not very clear to me why you would want an inverted index for this. Inverted indexes are good for searching data that is NOT already organized into a structure. Typically, it would be for full-text searching. To answer questions like "how many responents chose answer C on question 6?" you would be better off with a standard databases. MySQL can handle tables with many millions of rows in them, so your data set doesn't sound like a problem to me at all.
  • Comment on Re: Does Search::InvertedIndex module live up to billing?

Replies are listed 'Best First'.
Re^2: Does Search::InvertedIndex module live up to billing?
by punch_card_don (Curate) on Oct 21, 2004 at 16:34 UTC
    Well, I've tried the 3,000 column by 20,000 row thing.

    respondent_id, response_1, response_2, .....

    Searches of the form

    select count(*) from theTable where (response_1=1 and response_n=1);

    performs abysmally. Averages ~ 0.24 seconds per select execution. I'm looking for under 0.1 sec because I'm producing reports in real time that require a few hundred selects to do to produce a single report.

    I can use respondent _id as primary key, but each of the 3,000 columns is equally likely to be searched on and I can't index them all. I don't believe mysql will allow 3,000 indexes on a single table.

    So .... I got hinking about an inverted index so I could quickly retrieve the list of respondents who gave response x, the list of respondents who gave response y, and then compute their intersection.

    My first version of this was to have a separate table for each possible reponse and populate it with respondent id's. That improved performance to ~0.16 sec average. Of course, each table has its own index. If, instead, I had one large inverted index,

    response_id, respondent_1, respondent_2, ...

    I would only need to use the one response_id column as primary key, reducing total size and complexity of the database. In that case I have two options:

    • put respondent id's in the columns and my selects become
      select * from theTable where response_id = x
    • use the response id's as column names and fill with 1s and 0s, but then I need a way to associate matches with their column names.....
    Soooo, I was thinking of trying the search engine trick and building a reverse index like Search::InvertedIndex.

    I'm very interested in ideas.

    Thanks.

      I'm not sure I understand your database schema. From how I understood it, I created the following SQL schema (in SQLite syntax), which is not denormalized, so it might be suboptimal, but it is as clean as I can imagine it. By "fk" I mean "foreign key", so offered_choice.correct should be a value that can be found in the response.id column.

      create table response ( -- models a possible response id integer primary key, text varchar ); create table question ( -- models a possible question id integer primary key, text varchar ); create table offered_choice ( -- models the offered choices to a question id integer primary key, question integer, choice integer -- fk into response ); create table questionnaire ( id integer primary key, position integer, choice_set integer, -- fk into offered_choice correct integer -- fk into response ); create table answer ( -- models a given answer id integer primary key, subject integer, questionlet integer, response integer ); create table subject ( -- models a person taking a test id integer primary key, name varchar() );

      With these tables filled, and indices at least on every primary key column, your queries become joins over multiple tables, but they should also not be too slow.

      Your dataset doesn't seem too large though, so I wonder what you did to actually optimize MySQL performance - MySQL can hold a (read-only) table completely in memory, so you should tell MySQL to do just that - this should speed up your queries by quite a lot too. On the other hand, I still wonder about your current database schema, because blindly throwing around 3000 tables, or 3000 indices doesn't make much sense to me. I'm also not sure if every subject got asked every question and got offered every possible response - my schema allows for mixing the responses offered to every question, but also allows for only one "correct" response to every questionlet.

      I don't want to sound too harsh, but you really need to do some reading on relational database schema design. There is no problem I can think of that would require 3000 columns to represent it. Something more like Corion's design is what I would have done. With proper indexes, this should handle the relatively small dataset you have very efficiently.
        Perrin, you're welcome to be harsh - I'm pretty thick skinned and prefer to learn than be coddled.

        And I really appreciate the time put into that reponse, Corrin. But I suspect I've not expressed myself correctly.

        I have been given a datafile that contains two pieces of information for each respondent: respondent id, and a list of response id's that respondent chose:

        respondent_1: 123, 456, 789, 159, 753, ...<br> respondent_2: 753, 654, 987, 321, 963, ...<br> ..etc...
        Respondent id's go from 1 to 20,000. Response id's go from 1 to 3,000. On average each respondent would have chosen about 1/5 of possible responses, so the average respondent's record is ~600 response id's long. (Note that that means the average response was chosen by about 1/5 of respondents.) That's what I've got to work with.

        The objective is to generate reports for:

        count respondents who gave response x and response y
        where it may be just x and y, or it may be x and y and z and w as a percentage of respondents who gave responses a and b and c, and a single report is made of a few hundred such queries. These reports are being generated on the fly (too many millions of combinations and permutations to pre-generate them), so query speed is paramount. The data never changes once loaded, so updates, inserts, deletions, etc are of no concern.

        So, I don't have the information to create these other tables. I might be able to get it, but I'd like to do the best with what I've got before asking for information I know will be hard to acquire.

        I went to the full 3,000 columns with the idea of making a column for each possible response id, making columns boolean tinyint(1), then doing

        select count(*) from theTable where (response_id_x=1 and response_id_y +=1);
        With the respondent column as primary key, it's averaging ~0.25 seconds per query today (using Benchmark module to measure). And, as I think we all concur, 3,000 indexes is not practicable.

        I've also tried a two-column structure:

        respondent_id, response_id
        with a compound primary key. This means about 600 x 20,000 = 12-million rows. Doing queries of the form
        select count(*) from theTable A inner join theTable B on (A.respondent + = B.respondent) where (A.response_id = x and B.response_id = y);
        on this gives me ~0.15 sec per query execution. I tried removing the join out to Perl by doing two separate queries on the response id's and then finding the intersection of the two sets, but tests eventually showed no speed difference between the two with optimize code.

        Letting my experimentation wander into the ridiculous, I also tried a unique inverted index, using one single-column table for each response_id, using that column as primary key, and filling with respondent_id's who had given that reponse. That means each table is going to average 20,000/5 =~ 4,000 rows to join. Thus, the query became:

        SELECT COUNT(*) FROM table_X JOIN table_y ON (table_x.respondent = tab +le_y.respondent)
        and this got me down to ~0.07 sec per query execution. Not bad speed-wise, but bad design, I know. In this case, moving the join out to a Perl intersection routine was measurably slower.

        Because I know that's bad design, I'm looking for alternatives within the limited framework of my problem that are better design but as fast or faster. Hence my drifting into thinking of other inverted index possibilities.

        By the way - I'm posting this here at PM 'cause the whole thing is built on Perl DBI.

        hmmm - post seems to have disappeared...