I've been working on latent semantic search engines in Perl, and have a similar module on hand, although it won't be ready for CPAN for a while. Based on my experience, here are some questions you may wish to consider as you design your code:
- Do you want to support exact phrase matching? If so, what constitutes a phrase, and how is it parsed out? Do the elements of a phrase match also count as keywords?
- Are you assuming the search terms will always be in English? If so, you may consider using a stemmer like Lingua::EN::Stemmer, to improve recall
- Do you want to ignore case in the query, or use it for clues about which words are wanted? For example, do you want to recognize proper names and treat them differently based on capitalization? Acronyms ( 'AIDS' vs. 'aids')
- Do you want to consider word order as important? This could make a difference in collocations, ( 'hot dog' vs. 'dog hot')
CPAN module or not, there is quite a bit of existing code in the field, so I would encourage you to look around ( as you are doing! ) before you do too much coding. A good reference is
Foundations of Statistical Natural Language Processing by Manning and Shütze.
I'm happy to share my own code if you like, as well.
Good luck!
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.