In option one, it searches for all terms and every variation individually. Option two converts all synonyms of terms to their respective root prior to gathering word frequencies. In short, would it be more efficient to search 100,000 hash keys in option one,
or use option two to do global substitutions on the documents so as to reduce the hash to only 10,000 keys?
Basically, I have a large set of documents and want to categorize them and be able to explore and search based on terms. Terms would include root terms and their synonyms. For example, "meat" as a root term could mean "ham", "beef", or "chicken". This could be stored in a hash so synonyms all lead to the root term:
$termHash{ "ham" } = "meat";
$termHash{ "beef" } = "meat";
In this way, different terms that mean or relate to basically the same thing will be counted all as occurrences of the base term. When searching for "ham", one might see documents that contain "beef" or "meat" as well as "ham".
The numbers are a bit overwhelming:
- 26 million documents
- 10,000 base terms
- 100,000 synonyms
This will run on a cluster where each node works independently on a set of the documents. It's looking like the term hash will be a built separately and stored as either a persistent hash or an xml file.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.