in reply to Creating Metadata from Text File
Since you already know where your list of stop words (the non-useful words) comes from, you could start out like this:
If you run that on each document file, and concatenate all the outputs together into a simple two column table, you can then provide a search tool that uses a simple query like:#!/usr/bin/perl use strict; ( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] ) or die "Usage: $0 stopword.list document.file\n"; my ( %stopwords, %docwords ); my ( $stopword_name, $document_name ) = @ARGV; open( I, "<", $stopword_name ) or die "$stopword_name: $!"; while (<I>) { my @words = grep /^[a-z]+$/, map { lc() } split /\W+/; $stopwords{$_} = undef for ( @words ); } close I; open( I, "<", $document_name ) or die "$document_name: $!"; while (<I>) { for ( grep /^[a-z]+$/, map { lc() } split /\W+/ ) { $docwords{$_} = undef unless ( exists( $stopwords{$_} )) } } close I; for (keys %docwords) { print "$document_name\t$_\n"; }
When a user wants all docs that contain "foo" or "bar" (or "baz" or ...), just keep adding " or doc_word = ?" clauses on that query. Other boolean queries ("this_word and that_word", etc) can be set up easily as well.SELECT distinct(doc_id) from doc_word_index where doc_word = ?"
There are plenty more bells and whistles you can add as you come up with them... things like "stemming" (so a doc that contains only "blooming" or "blooms" or "bloomed" will be found when the search term is "bloom"), "relevance" (sort the returned list based on counting the number of distinct search terms per doc), and so on.
(update -- forgot to mention: When building a simple table like that, don't forget to tell the database system to create an index on the "doc_word" column, so that the queries can be answered quickly, without having to do a full-table scan every time.)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Creating Metadata from Text File
by Trihedralguy (Pilgrim) on Jul 21, 2007 at 01:55 UTC | |
by graff (Chancellor) on Jul 22, 2007 at 01:34 UTC | |
by Trihedralguy (Pilgrim) on Jul 23, 2007 at 12:44 UTC | |
by graff (Chancellor) on Jul 25, 2007 at 01:13 UTC |