in reply to Search Engines for Dummies

The benefit of using a database, either an RDMBS or DBM, is that the question "which files use this word" can be answered in a relatively small number of disk reads compared with scanning a large flat file.

With a flat file, you have to read until you find the word (or determine it isn't there). On average, you'll have to read half of the file. Worst case, the entire file. Assuming a 400Kb file is read in 8K chunks (by underlying OS buffering), that's 25 reads on average; 50 worst case.

With a database, you'll incur a small number or disk reads to traverse the index, and then a single read to pick up the rest of the record (file numbers, in your example).

If the search expression contains a number of words, at some point it might be faster to scan a flat file (assuming you're going to search for all terms in a single pass). You may need to determine the break-even point by experimenting.

Another issue to consider in choosing between a flatfile or database-based implementations is ease of update, particularly if you're going to attempt incremental updates. It's going to be faster to make a single pass over a flatfile (writing an updated flatfile to replace it) than it is to scan the database record-by-record.

With that as background: let's consider your questions.

Q: Is there something inherently better about using a database?
A: "It depends", but the answer is probably YES.

Q: What's the DBM format, and how does it compare to SQL?
A: Both separate indexes from record data, allowing a "does this key exist" search to be done with a small number of disk accesses.

Q: Is DBM common?
A: Yes, but some (older, poor) implementation limit the record size to 1K. Check this on your system.

Q: Where can I go to learn about Perl talking to SQL?
A: There are a number of good books and on-line articles that talk about using Perl's DBI interface (DBI.pm) to talk to databases. Take a look at O'Reilly's new ONLAMP site, or search for "Perl DBD" on <href="http://www.google.com/">Google.

Q: Should I be opening my 400Kb file for every search term?
A: No. Write some extra code so that you can do the search in one pass.

Q: What about RAM?
A: If you're scanning the file line by line, RAM shouldn't be an issue for you. Perl will buffer a disk page at a time, from which it extracts lines (which you read into $_, which gets overwritten every time you read a line.)

Q: What about my data structures?
A: On first blush, they seem O.K. Having URLs and document titles available as arrays isn't going to significantly increase your script's startup time, and having your indexing process emit a .pl file you can require is an elegant way to load those arrays.

Let's pretend you also asked...

Q: Is there anything else I can do to save space or time?
A: Consider using a set of "stop words" -- words that are so common that they're liable to occur in nearly every document (e.g., "a an are in it is the this") and don't index those words. In the unlikely event that someone searches only on one of those terms, you can either verbally abuse them or do a linear search through your documents. You can probably cut your index size in half.

P.S. Good first post. Welcome to the Monastery.

Replies are listed 'Best First'.
Re: (dws)Re: Search Engines for Dummies
by hostile17 (Novice) on Feb 26, 2001 at 11:27 UTC
    Thank you both very much for your help.
    having your indexing process emit a .pl file you can require is an elegant way to load those arrays.

    Thank you, that's smart, I can add that to the indexer easily. That file just reads:
    @myarrayoffilenames= qw(

    the list of filenames

    );
    right? It doesn't need to be a fully-fledged file with a #! line and everything else?

    Consider using a set of "stop words" -- words that are so common that they're liable to occur in nearly every document (e.g., "a an are in it is the this")

    I've already limited the index to words of four letters or more, with some exceptions, but you're quite right, there are lots of words which are in every file so I can cut major chunks out of the index, by hand, right now!
      I've already limited the index to words of four letters or more,

      Well, there goes "sex" :)

      Seriously, a four-or-more letter rule isn't very good. You risk dropping significant two or three letter terms (e.g., AI, XML), and while cluttering up the index with common words (e.g., were, which).

      Try this simple experiment. Sort your index by the length of each line. Terms that appear in all or nearly all of the documents will rise to the top. Then look at the first 100 or so words. If they're not "significant" (and here you'll have to make the call on what's significant to your domain), then add them to a list of words that the indexer will ignore.

Re: (dws)Re: Search Engines for Dummies
by Danilo (Novice) on Feb 26, 2001 at 20:17 UTC
    Assuming a 400Kb file is read in 8K chunks (by underlying OS buffering), that's 25 reads on average; 50 worst case.

    I'll also assuming that you are reading the file from the begining. Why not try doing a binary search and start in the middle? I'm not sure of big O notation for this, but it would reduce the number of searches if you stick with the flat file.

    update: You could also divide up the file by frequency letters starting words in the language of choice (english?). Or, start the flat file with a listing of how many words start with a, b,.. then compare to search word and read x lines into file and maybe do a binary search if there are a lot of words that start with a particular letter.
      You could have the indexer remember the starting position of each letter in the alphabet, and emit a separate require-ready file of these positions. That would indeed cut the search down. Or you could sort the index by word length, and remember those starting positions.

      Note that in either case, you're building an index, and have set one foot on the slippery slope towards a using DBM or SQL. Why not go all the way?

        I think that I would try using the a db just for the exercise, but I was going under the impression that the size of his file was small enough that implementing a db would have more overhead, and might not be worth the effort to learn a db module.