<doc id="some_id_string" title="title string" date="date_string" ...>
<text>
This is the text of the article. Blah blah...
</text>
</doc>
...
Now, suppose you define a new table for indexing the corpus content, with these columns:
create table content_index (
search_term varchar(50),
doc_id varchar(30),
in_title char(1),
in_body char(1),
how_many integer
)
A perl script can read through the corpus and create a flat text file that you can load into this table. The perl script is easy (the one below is lightly tested, and I presume some monks will chastise me for not using an XML module... then again, you could just as well use some other method to format the text, or simply adapt this script to get what it needs directly from the database, rather than reading from a text file):
#!/usr/bin/perl
local $/ = "</doc>\n";
while (<>) { # reading from the corpus text file...
my %tknhist = ();
# get the docid and title:
my ($id,$title) = (/id=\"(.*?)\" title=\"(.*?)\"/);
# isolate the text
my ($text) = ( m{<text>\s+(.*?)</text>}s );
# downcase, remove punctuation, tokenize, count
my $in = "ttl";
for ( $title, $text ) {
tr/A-Z'".,;:!?#&%$[]()0-9/a-z/d; # everything from ' on is re
+moved
for my $tkn ( grep /\w{3,}/, split( /\s+/ )) {
$tknhist{$in}{$tkn}++; # only count words >3 letters
}
$in = "bdy";
}
for my $tkn ( keys %{$tknhist{bdy}} ) {
my $in = "N,Y"; # "not_in_title,in_body"
if ( exists( $tknhist{ttl}{$tkn} )) {
$in =~ s/N/Y/; # this token is in both places
$tknhist{bdy}{$tkn} += $tknhist{ttl}{$tkn};
delete $tknhist{ttl}{$tkn};
}
print join( ",", $tkn, $id, $in, $tknhist{bdy}{$tkn} ), "\n";
}
for my $tkn ( keys %{$tknhist{ttl}} ) { # tokens in title only (i
+f any)
print join( ",", $tkn, $id, "Y,N", $tknhist{ttl}{$tkn} ), "\n"
+;
}
}
So that produces a listing where each line contains a "word,doc_id" tuple, together with information on whether the word was in the title, text body or both, and how many times total that word was in that doc.
Now, you need to filter this flat-table text output a bit before you load it into your new index table. If a given "search term" shows up in all the docs -- or in some large quantity of them -- it's not much good as a search term, so it wouldn't make sense to have it in the "search term" indexing table.
Any number of means can be used to determine the "document frequency" of each search term -- that is, how many docs contain the given term -- e.g. this unix shell command line:
cut -f1 -d, table-data | sort | uniq -c | sort -nr > word.doc-freqs
produces a listing with "#docs word", showing the most common words first. At the top of the list is some number of common English words that show up in every doc. Skip those, and continue skipping down the list until you reach a point where the terms start to look "informative" or "distinctive", and the document frequency amounts to less than some sensible percentage of the corpus (50% of the docs? more? less? I'm not sure...) You may also decide to eliminate or fix obvious misspellings (or not).
Once you establish the cut-off point, filter your content-index data to remove all the words that are above the cut-off, and load the remainder into your new table. Now you can ask a user for search terms, and start by querying for those terms in this index table -- if a term isn't there, it's either too frequent, or non-existent (useless in either case). If all the terms provided by the user end up as no-shows, you'll have to ask for different terms.
When the terms show up and yield a large number of rows (i.e. many docs containing the terms), you can alert the user about the size of the yield, and handle paging through the set as need be, because you know from the query results just how many docs matched. The query on this table can include an "order by how_many desc", which sorts the returned docids so that the ones with the most occurrences of the search term show up first. And you can limit the search by "title contains word_x" or "body contains word_x" by including tests for "in_title='Y'" (or 'N') and "in_body='Y'" (or 'N').
Just be sure that when you get search terms from the user, you treat them the same way as you did the tokens in the corpus -- convert to monocase, eliminate all punctuation, and ignore terms with less than 3 letters -- before you pass them into query on the content_index table.
update: after looking at the OP again, I wanted to add that if you're clever about making up the doc_id strings -- fold the dates into them in a way that makes lexical sort the same as chronological sort -- you could handle date constraints as well as search terms when you query the content_index table ("doc_id > oldest_date_wanted", "doc_id < newest_date_wanted"). This might be considered "cheating" a bit, but I think it's justified in a "simple, crude" (quick-and-dirty) setting. |