Re: (dws)Re: Search Engines for Dummies

Thank you both very much for your help.

having your indexing process emit a .pl file you can require is an elegant way to load those arrays.

Thank you, that's smart, I can add that to the indexer easily. That file just reads:

@myarrayoffilenames= qw( the list of filenames );

right? It doesn't need to be a fully-fledged file with a #! line and everything else?

Consider using a set of "stop words" -- words that are so common that they're liable to occur in nearly every document (e.g., "a an are in it is the this")

I've already limited the index to words of four letters or more, with some exceptions, but you're quite right, there are lots of words which are in every file so I can cut major chunks out of the index, by hand, right now!

Comment on Re: (dws)Re: Search Engines for Dummies

Replies are listed 'Best First'.
Re: Re: (dws)Re: Search Engines for Dummies by dws (Chancellor) on Feb 26, 2001 at 21:53 UTC
I've already limited the index to words of four letters or more, Well, there goes "sex" :) Seriously, a four-or-more letter rule isn't very good. You risk dropping significant two or three letter terms (e.g., AI, XML), and while cluttering up the index with common words (e.g., were, which). Try this simple experiment. Sort your index by the length of each line. Terms that appear in all or nearly all of the documents will rise to the top. Then look at the first 100 or so words. If they're not "significant" (and here you'll have to make the call on what's significant to your domain), then add them to a list of words that the indexer will ignore.	[reply]

Replies are listed 'Best First'.

Re: Re: (dws)Re: Search Engines for Dummies
by dws (Chancellor) on Feb 26, 2001 at 21:53 UTC

I've already limited the index to words of four letters or more,

Well, there goes "sex" :)

Seriously, a four-or-more letter rule isn't very good. You risk dropping significant two or three letter terms (e.g., AI, XML), and while cluttering up the index with common words (e.g., were, which).

Try this simple experiment. Sort your index by the length of each line. Terms that appear in all or nearly all of the documents will rise to the top. Then look at the first 100 or so words. If they're not "significant" (and here you'll have to make the call on what's significant to your domain), then add them to a list of words that the indexer will ignore.

[reply]