in reply to Creating Metadata from Text File

So you're talking about building an index for a set of documents, and using a list of "stop words" so that only the "useful" words are indexed. Presumably, for each "useful" word, you want to keep track of all the documents contain that word. As one of the other replies points out, a database server can be a good tool for this sort of thing, but the basic index list could start out as just list of rows containing two fields: "doc_id usable_word", to indicate that a particular useful word was found in a particular document.

Since you already know where your list of stop words (the non-useful words) comes from, you could start out like this:

#!/usr/bin/perl use strict; ( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] ) or die "Usage: $0 stopword.list document.file\n"; my ( %stopwords, %docwords ); my ( $stopword_name, $document_name ) = @ARGV; open( I, "<", $stopword_name ) or die "$stopword_name: $!"; while (<I>) { my @words = grep /^[a-z]+$/, map { lc() } split /\W+/; $stopwords{$_} = undef for ( @words ); } close I; open( I, "<", $document_name ) or die "$document_name: $!"; while (<I>) { for ( grep /^[a-z]+$/, map { lc() } split /\W+/ ) { $docwords{$_} = undef unless ( exists( $stopwords{$_} )) } } close I; for (keys %docwords) { print "$document_name\t$_\n"; }
If you run that on each document file, and concatenate all the outputs together into a simple two column table, you can then provide a search tool that uses a simple query like:
SELECT distinct(doc_id) from doc_word_index where doc_word = ?"
When a user wants all docs that contain "foo" or "bar" (or "baz" or ...), just keep adding " or doc_word = ?" clauses on that query. Other boolean queries ("this_word and that_word", etc) can be set up easily as well.

There are plenty more bells and whistles you can add as you come up with them... things like "stemming" (so a doc that contains only "blooming" or "blooms" or "bloomed" will be found when the search term is "bloom"), "relevance" (sort the returned list based on counting the number of distinct search terms per doc), and so on.

(update -- forgot to mention: When building a simple table like that, don't forget to tell the database system to create an index on the "doc_word" column, so that the queries can be answered quickly, without having to do a full-table scan every time.)

Replies are listed 'Best First'.
Re^2: Creating Metadata from Text File
by Trihedralguy (Pilgrim) on Jul 21, 2007 at 01:55 UTC
    I love you, but now my weekend is ruined that I finally understand how to do this project...I'll keep you posted!! :)
      One other thing: you may want to apply the stop-word list to the query terms that someone submits when doing a search. You know these words are not in the index, so why waste time querying for them? (It might even serve as a form of instruction for the user: "based on what you entered, here are the words being used in the search: ...")

      Also, after you load the index table and you know how many docs are indexed (let's say it's 5000), you might want to try a query like:

      SELECT count(doc_id),doc_word from doc_word_index group by doc_word order by count(doc_id) desc limit 20
      If there are words that occur in all 5000 docs, you might as well add those to your stop list. (If the output of that particular query shows all 20 words with "5000", set the limit higher, to see how many words there are that occur in all documents.)

      In fact, if you start out by indexing all words, you can build your own stop list this way, and it might be more effective than just assuming that someone else's list of "most frequent words" is appropriate for your particular set of docs. You might also decide that the threshold for inclusion in the stop list is something like "occurs in 90% of docs", as opposed to "occurs in all docs". (The "document frequency" of words -- how many docs contain a given word -- can be a useful metric for assigning weights to search terms when you get into ranking the "hits" according to "relevance".)

      Note that the "most frequent words" list you cited includes things like "number", "sound", "water", "air", "father", "mother", "country", etc, but these might occur only in some of your docs -- someone might have a valid expectation that they would be useful as a search terms, and it would be wrong not to index them.

        I've never done anything like this before:
        ( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] ) or die "Usage: $0 stopword.list document.file\n";

        What is going on here? When I run this code, it askes me the die like. I dont understand.