I've been working on latent semantic search engines in Perl, and have a similar module on hand, although it won't be ready for CPAN for a while. Based on my experience, here are some questions you may wish to consider as you design your code:
- Do you want to support exact phrase matching? If so, what constitutes a phrase, and how is it parsed out? Do the elements of a phrase match also count as keywords?
- Are you assuming the search terms will always be in English? If so, you may consider using a stemmer like Lingua::EN::Stemmer, to improve recall
- Do you want to ignore case in the query, or use it for clues about which words are wanted? For example, do you want to recognize proper names and treat them differently based on capitalization? Acronyms ( 'AIDS' vs. 'aids')
- Do you want to consider word order as important? This could make a difference in collocations, ( 'hot dog' vs. 'dog hot')
CPAN module or not, there is quite a bit of existing code in the field, so I would encourage you to look around ( as you are doing! ) before you do too much coding. A good reference is
Foundations of Statistical Natural Language Processing by Manning and Shütze.
I'm happy to share my own code if you like, as well.
Good luck!