in reply to Database searches with persistent results via CGI

I have written a similar search to the one you describe, this is the approach I took.

1. build an indexer which takes all the words used in the search and put them into a table with these columns.

word, record_id, word_count, word_in_title

word == the word itself
record_id == where the word points to
word_count == frequency of word in document
word_in_title == does word exist in title

searching for terms than goes like this( notice that I only put the % at the end of the like so it will still use an index! )

select record_id, sum(word_count), max(word_in_title) from TABLE where word like '$word%' group by record_id

if the user searches for multiple words than run the above statement multiple times and sum them up by record_id.

Spit them back out by the max word_count.

My example site: ( does no caching of results, still very quick )

  • Comment on Re: Database searches with persistent results via CGI

Replies are listed 'Best First'.
Re: Re: Database searches with persistent results via CGI
by elwarren (Priest) on May 10, 2002 at 17:23 UTC
    This is exactly what I was about to suggest, I love PerlMonks!

    I would take it one step further and build the table ahead of time. Put all the names in this table and run your search against it. This way you limit the amount of data you chug through inthe search and then just follow the pointer to the bigger records you want to return. So instead of searching 1.5m rows x 20 columns, you'll be searching .90m rows x 2 columns.

    You could do this with a foreign key or you could do a split($name) and store the results in the first column. This would allow you to remove the LIKE and replace it with a regular match. When a user searches for 'JOHN%' the query would look kind of like this:
    select record_id from lookup where name = 'JOHN'
    and the table:
    NAME RECORD_ID ---- --------- JOHN 1 SMITH 1 BILL 2 SMITH 2 JOHN 3 DEER 3 JOHN 4 DOE 4
    and your result set would be 1,3,4. Say the user wanted all of the John Smiths out of the db, your query would look like:
    select record_id from lookup where name in ('JOHN', 'SMITH') group by record_id
    and your result would be 1, the group by only returns the one match because they are both the same record.

    Mucho faster than the big table. You can index both columsn and improve your speed again. If you're running Oracle then the name column is an ideal candidate for using a bitmap index. The record_id column would obviously perform better as your normal btree style index.

    Please please please do not create and drop temporary tables. If it's part of your design then make a table and insert and delete your records out of it. There is much more overhead in creating a table than there is inserting rows. Once you application is designed there should be no reason to be issuing DDL statements in production. Besides, we (the DBAs) hate going in and cleaning out temp tables because we need to figure why the developer left them all over and whether they still need them or not. If they're in your spec then they're exactly where they need to be.