Re^3: Creating Metadata from Text File

One other thing: you may want to apply the stop-word list to the query terms that someone submits when doing a search. You know these words are not in the index, so why waste time querying for them? (It might even serve as a form of instruction for the user: "based on what you entered, here are the words being used in the search: ...")

Also, after you load the index table and you know how many docs are indexed (let's say it's 5000), you might want to try a query like:

SELECT count(doc_id),doc_word from doc_word_index group by doc_word
  order by count(doc_id) desc limit 20
[download]

If there are words that occur in all 5000 docs, you might as well add those to your stop list. (If the output of that particular query shows all 20 words with "5000", set the limit higher, to see how many words there are that occur in all documents.)

In fact, if you start out by indexing all words, you can build your own stop list this way, and it might be more effective than just assuming that someone else's list of "most frequent words" is appropriate for your particular set of docs. You might also decide that the threshold for inclusion in the stop list is something like "occurs in 90% of docs", as opposed to "occurs in all docs". (The "document frequency" of words -- how many docs contain a given word -- can be a useful metric for assigning weights to search terms when you get into ranking the "hits" according to "relevance".)

Note that the "most frequent words" list you cited includes things like "number", "sound", "water", "air", "father", "mother", "country", etc, but these might occur only in some of your docs -- someone might have a valid expectation that they would be useful as a search terms, and it would be wrong not to index them.

Comment on Re^3: Creating Metadata from Text File Download Code

Replies are listed 'Best First'.
Re^4: Creating Metadata from Text File by Trihedralguy (Pilgrim) on Jul 23, 2007 at 12:44 UTC
I've never done anything like this before: `( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] ) or die "Usage: $0 stopword.list document.file\n";` [download] What is going on here? When I run this code, it askes me the die like. I dont understand.	[reply] [d/l]
Re^5: Creating Metadata from Text File by graff (Chancellor) on Jul 25, 2007 at 01:13 UTC
That's a way of conveying a "usage synopsis" to a user when the command is run without appropriate arguments. In this case, the "synopsis" is a message that says the user should enter the name of the command (i.e. the name of the perl script) followed by the name of a stopword list file, followed by the name of a document text file. The part between parentheses tests whether @ARGV contains two elements, and then whether each element is the name of an existing data file. If any of those three conditions is false, it goes into the "or die ..." clause, and the program exits with the usage synopsis. In case your shell environment requires that you run "perl.exe" followed by the name of your script, just add the names of the two data files after the name of the script in order to get it to actually run with those two files as input. Don't forget to redirect STDOUT to a file: `perl name_of_script.pl stopword.file doc.file > table.file` [download]	[reply] [d/l]