comment on

So you're talking about building an index for a set of documents, and using a list of "stop words" so that only the "useful" words are indexed. Presumably, for each "useful" word, you want to keep track of all the documents contain that word. As one of the other replies points out, a database server can be a good tool for this sort of thing, but the basic index list could start out as just list of rows containing two fields: "doc_id usable_word", to indicate that a particular useful word was found in a particular document.

Since you already know where your list of stop words (the non-useful words) comes from, you could start out like this:

#!/usr/bin/perl

use strict;

( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] )
    or die "Usage:  $0  stopword.list  document.file\n";

my ( %stopwords, %docwords );
my ( $stopword_name, $document_name ) = @ARGV;

open( I, "<", $stopword_name ) or die "$stopword_name: $!";
while (<I>) {
    my @words = grep /^[a-z]+$/, map { lc() } split /\W+/;
    $stopwords{$_} = undef for ( @words );
}
close I;

open( I, "<", $document_name ) or die "$document_name: $!";
while (<I>) {
    for ( grep /^[a-z]+$/, map { lc() } split /\W+/ ) {
        $docwords{$_} = undef unless ( exists( $stopwords{$_} ))
    }
}
close I;

for (keys %docwords) {
    print "$document_name\t$_\n";
}
[download]

If you run that on each document file, and concatenate all the outputs together into a simple two column table, you can then provide a search tool that uses a simple query like:

SELECT distinct(doc_id) from doc_word_index where doc_word = ?"
[download]

When a user wants all docs that contain "foo" or "bar" (or "baz" or ...), just keep adding " or doc_word = ?" clauses on that query. Other boolean queries ("this_word and that_word", etc) can be set up easily as well.

There are plenty more bells and whistles you can add as you come up with them... things like "stemming" (so a doc that contains only "blooming" or "blooms" or "bloomed" will be found when the search term is "bloom"), "relevance" (sort the returned list based on counting the number of distinct search terms per doc), and so on.

(update -- forgot to mention: When building a simple table like that, don't forget to tell the database system to create an index on the "doc_word" column, so that the queries can be answered quickly, without having to do a full-table scan every time.)

In reply to Re: Creating Metadata from Text File by graff
in thread Creating Metadata from Text File by Trihedralguy

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.