in reply to Hash of Arrays versus Two Hashes
I think these both tend to disfavor an approach where you have separate/parallel hashes, because coding the existing method, and adding extensions later, requires larger, more tedious amounts of script. Much easier to combine numbers that happen to be in one array already, and much easier to just push more numbers onto the array as needed.
If you're worried about the potential obscurity of using just an array index (0, 1, ...) to mean "single-term frequency score", "combined-term score", ... well, there are easy ways to keep it clear and explicit (#COMMENT IT!) -- at worst, you could use HoH to store the numeric values of each type for each doc, but that seems unnecessary in this sort of case.
BTW, another feature that is commonly used in searches is the relative "distinctiveness" of search terms: given some basis for knowing the a priori likelihood of each term in general usage, assign more or less weight to its occurrence in a document; for example, if the search terms are "ionized charged", documents that contain only the first should probably score higher than those that contain only the second, and the relative weights of these two terms is inversely proportional to their respective frequencies in some collection of typical text data.
|
|---|