ruhk has asked for the wisdom of the Perl Monks concerning the following question:

After much sitting here and scratching my head, I guess I will have to turn to my fellow monks for some help.

First of all, let me explain my project. I am writing a search engine for a database of Campus Newspaper articles dating back to 1925 and only up to 1980. As of right now, it is about 60,000 rows long. It will most likely grow even bigger in the future.

About a week ago I completed the project and it worked pretty good, until I tried to search for a word that had more than a few thousand hits. Let me quick briefly explain how things are done right now.

My sql statement is:
my $searchstring = 'SELECT * from Exponent WHERE ( Description LIKE '%$tempword%' OR Article LIKE '%$tempword%' ) AND ( Date > '$datefrom' ) ";

This pulls everything down that contains even the smallest match, then I have perl loop over the results and weight them by the number of times the word is contained in the article title and article description.

This works fine until I try to pull down 30k rows. My question is this, is it possible in SQL to somehow write a function that takes 3 parameters ( a pattern to match, and a startslice and endslice so I only have to pull down the 10 or 20 matches I want to look at, not the whole thing?

Im afraid I am still learning perl, and my SQL is at even a lesser level than my perl understanding. Sorry for this rather long winded question, I hope I make sense :P

Replies are listed 'Best First'.
Re: Inefficient search algorithm?
by Abigail-II (Bishop) on Jan 13, 2004 at 16:29 UTC
    Are you sure you want to do this in SQL? Taking 'slices' of your results isn't very SQL-like, and doing text searches in the where clause isn't a thing in which SQL shines as well (since there won't be constraints the engine can use).

    Considering the secret of time travel hasn't been discovered yet, there won't be any new articles of the Campus Newspaper that will be printed between 1925 and 1980, nor will they become unprinted. Your engine screams for a solution where the articles are preprocessed, and indexed based on their content. You know, like google and other search engines do it). That doesn't mean you don't store anything in a SQL database - in fact, you could still store all your data in a SQL database. But you wouldn't be doing full table scans, with full scans of text fields.

    Abigail

Re: Inefficient search algorithm?
by hardburn (Abbot) on Jan 13, 2004 at 16:32 UTC

    You're looking for the LIMIT clause in SQL. Now, because relational databases never guarentee that their results are returned in any order, you have to combine it with a ORDER BY clause so the results are sorted.

    Another note is that you should avoid the use of SELECT *. It's slow, and here again, relational databses don't guarentee the columns will be in order (I wish more databases would pseduo-randomize the column order just to discourage this practice). You should instead specify the columns you want explicitly, e.g., SELECT Date, Description, Article WHERE . . ..

    Finally, use placeholders. They're not too hard to work with, they're more secure, and they help statement caching. I have yet to see a good reason not to use them whenever possible. The nodes linked to should have some good examples of using them.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

Re: Inefficient search algorithm?
by ViceRaid (Chaplain) on Jan 13, 2004 at 16:28 UTC

    Afternoon

    How to get a limited number of rows of results back varies from database to database. You didn't mention which database you're using, but in MySQL, for example, the syntax is:

    SELECT * FROM foo LIMIT 5, 10;

    This would fetch 10 rows, starting from the 5th row. If you're not using MySQL, you'll need to have a quick search, as I can't remember Oracle or Pg off the top of my head.

    To make searching any reasonably big corpus of data acceptably fast, you probably need to make use of an index; which is a table of words and phrases matched to the documents that contain them. Looking through every word of every document for every search is just too sloooooow. Take a look at this recent article on perl.com for more info if you're interested in writing your own search engine from scratch in perl.

    Alternatively, since writing a search engine is a complicated (though very interesting) exercise, consider using a ready-to-go free search engine that you can install and use on your server, for example Swish-E (perl) or htDig (not perl). I've used the latter with fairly satisfying results.

    Lastly, have a search round the monastery, as this kind of thing has been discussed on a few occasions before.

    update: added answer to the obvious question at the top. stupid me.

    cheers
    ViceRaid

Re: Inefficient search algorithm?
by duff (Parson) on Jan 13, 2004 at 16:33 UTC

    You can use a LIMIT clause on your SELECT to limit the number of entries returned. Check the docs for whatever RDBMS you are using.

Re: Inefficient search algorithm?
by ruhk (Scribe) on Jan 13, 2004 at 17:43 UTC
    Ok, first another quick question. If I could order the results by the number of times the pattern was matched, it would be just what I was looking for. Something like this:

    SELECT * FROM foo WHERE Description LIKE '%$word%' ORDER BY "number of matches?" LIMIT $start, $stop

    Ill eventually use placeholders before this goes live, I read the link you gave me and it is indeed a good idea to use them :P

    I also read the search engine article and I must say, making it like that would be a LOT of work for such a small, little used project. It looks fun to make though. Ill only go that road if I have to. Since this search wouldnt be used much, a cheap, simple, and slower solution will do. As long as it isnt toooo slow.

    What do you guys think?
      Yes, there is. The SQL gets a little squirrely and it takes longer to execute, but it shouldn't be too bad. You're also going to have to specify your columns for this. Something along the lines of:
      SELECT foo_count.num ,foo.first_col ,foo.second_col FROM foo ,(SELECT description ,COUNT(*) as num FROM foo WHERE description LIKE '%?%' GROUP BY description ) foo_count WHERE foo.description = foo_count.description ORDER BY 1,2,3 LIMIT ?, ?

      This is untested, but should work. Essentially, you're using the results of a search as a table for your second search. It's advanced SQL, but not absolutely nuts.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      There's a lot of nifty stuff that a real search engine will buy you, like doing searches based on root forms of words (e.g. search term "evicted" could match "evicts" in the database). But if you only have time for something relatively simple and crude, here's a sort of "off-the-cuff" idea (simple, crude...):

      Your collection of articles is a "corpus", with a set number of documents (articles) each of which has a set number of content words. Let's suppose you extract the full extent of this corpus out of your database and into a structured text file -- e.g. something like:

      <doc id="some_id_string" title="title string" date="date_string" ...> <text> This is the text of the article. Blah blah... </text> </doc> ...
      Now, suppose you define a new table for indexing the corpus content, with these columns:
      create table content_index ( search_term varchar(50), doc_id varchar(30), in_title char(1), in_body char(1), how_many integer )
      A perl script can read through the corpus and create a flat text file that you can load into this table. The perl script is easy (the one below is lightly tested, and I presume some monks will chastise me for not using an XML module... then again, you could just as well use some other method to format the text, or simply adapt this script to get what it needs directly from the database, rather than reading from a text file):
      #!/usr/bin/perl local $/ = "</doc>\n"; while (<>) { # reading from the corpus text file... my %tknhist = (); # get the docid and title: my ($id,$title) = (/id=\"(.*?)\" title=\"(.*?)\"/); # isolate the text my ($text) = ( m{<text>\s+(.*?)</text>}s ); # downcase, remove punctuation, tokenize, count my $in = "ttl"; for ( $title, $text ) { tr/A-Z'".,;:!?#&%$[]()0-9/a-z/d; # everything from ' on is re +moved for my $tkn ( grep /\w{3,}/, split( /\s+/ )) { $tknhist{$in}{$tkn}++; # only count words >3 letters } $in = "bdy"; } for my $tkn ( keys %{$tknhist{bdy}} ) { my $in = "N,Y"; # "not_in_title,in_body" if ( exists( $tknhist{ttl}{$tkn} )) { $in =~ s/N/Y/; # this token is in both places $tknhist{bdy}{$tkn} += $tknhist{ttl}{$tkn}; delete $tknhist{ttl}{$tkn}; } print join( ",", $tkn, $id, $in, $tknhist{bdy}{$tkn} ), "\n"; } for my $tkn ( keys %{$tknhist{ttl}} ) { # tokens in title only (i +f any) print join( ",", $tkn, $id, "Y,N", $tknhist{ttl}{$tkn} ), "\n" +; } }
      So that produces a listing where each line contains a "word,doc_id" tuple, together with information on whether the word was in the title, text body or both, and how many times total that word was in that doc.

      Now, you need to filter this flat-table text output a bit before you load it into your new index table. If a given "search term" shows up in all the docs -- or in some large quantity of them -- it's not much good as a search term, so it wouldn't make sense to have it in the "search term" indexing table.

      Any number of means can be used to determine the "document frequency" of each search term -- that is, how many docs contain the given term -- e.g. this unix shell command line:

      cut -f1 -d, table-data | sort | uniq -c | sort -nr > word.doc-freqs
      produces a listing with "#docs word", showing the most common words first. At the top of the list is some number of common English words that show up in every doc. Skip those, and continue skipping down the list until you reach a point where the terms start to look "informative" or "distinctive", and the document frequency amounts to less than some sensible percentage of the corpus (50% of the docs? more? less? I'm not sure...) You may also decide to eliminate or fix obvious misspellings (or not).

      Once you establish the cut-off point, filter your content-index data to remove all the words that are above the cut-off, and load the remainder into your new table. Now you can ask a user for search terms, and start by querying for those terms in this index table -- if a term isn't there, it's either too frequent, or non-existent (useless in either case). If all the terms provided by the user end up as no-shows, you'll have to ask for different terms.

      When the terms show up and yield a large number of rows (i.e. many docs containing the terms), you can alert the user about the size of the yield, and handle paging through the set as need be, because you know from the query results just how many docs matched. The query on this table can include an "order by how_many desc", which sorts the returned docids so that the ones with the most occurrences of the search term show up first. And you can limit the search by "title contains word_x" or "body contains word_x" by including tests for "in_title='Y'" (or 'N') and "in_body='Y'" (or 'N').

      Just be sure that when you get search terms from the user, you treat them the same way as you did the tokens in the corpus -- convert to monocase, eliminate all punctuation, and ignore terms with less than 3 letters -- before you pass them into query on the content_index table.

      update: after looking at the OP again, I wanted to add that if you're clever about making up the doc_id strings -- fold the dates into them in a way that makes lexical sort the same as chronological sort -- you could handle date constraints as well as search terms when you query the content_index table ("doc_id > oldest_date_wanted", "doc_id < newest_date_wanted"). This might be considered "cheating" a bit, but I think it's justified in a "simple, crude" (quick-and-dirty) setting.

Re: Inefficient search algorithm?
by ruhk (Scribe) on Jan 13, 2004 at 16:50 UTC
    Thank you for the help so far.

    First of all, if I could make an indexing system that would be great :P However, this is just a small project for the University and I dont have many hours to spend on this project, so I think it would be overkill to make something that complicated. Ive only been working on it for a few days, and its pretty much done except for the 'too many matches' problem. Not to mention, most of the work was converting the database from Access to Microsoft SQL.

    As for the slicing, I was thinking of having SQL order the results by number of hits, that way if you click on "See results 20-29" I could just tell SQL to give me 20-29 of the ordered results. Im not sure if that is how LIMIT would work, wouldnt LIMIT just give me a number of results less than the limit?

    Ill take a look at that article and the free search engines and see if I can persue that end. However, is everyone pretty much sure I am asking to do the impossible in SQL?

    *edit: As for new articles, at any time Publications can hand us a new updated Access Database, and I have to write the tools that will auto update the Microsoft SQL database, so my application actually has to allow for new articles to be published. I have to make them easy to use too, as the few people in the office that know perl might be gone one day as we are student employees.
Re: Inefficient search algorithm?
by maa (Pilgrim) on Jan 14, 2004 at 09:15 UTC

    Lots of databases offer an indexing mechanism for exactly this situation... they tend to do the opposite of what you do, though (sort of)... MtSQL FullText Index Search - the page isn't valign-ed very well so the text is quite far down... this is probably what you want!