Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm reviewing some search-engine-like code which scores documents according to two different criteria.

One is "number of times word appears", and the other is "contains n of the words searched for".

It's a crude but effective way of scoring documents.

If the search term is "apple banana cherry" a document may score 10 for its apple-content, 5 for its banana-content, 1 for its cherry-content, a score of 16, but then get a "bonus" of fifty points for containing all three words, putting it ahead of another document with 20 for cherry, 20 for banana but zero for apple.

When I wrote the code a year ago I never would have thought of using multidimensional structures, but having hung around the Monks for a year, it occurs to me that I can do it this way:

$total_word_score{$this_document} = 16; $number_of_matches{$this_document} = 3;
keeping two hashes, or I can do it this way:
$various_scores{$this_document} = [16,3];

are there any arguments -- stylistic, maintenance, extendability -- about doing it one way rather than the other?
--

“Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
M-J D

Replies are listed 'Best First'.
Re: Hash of Arrays versus Two Hashes
by djantzen (Priest) on Feb 12, 2003 at 03:31 UTC

    Keep them separate only if you need to do complex things dealing with them in distinction from one another. It sounds like generally you need both pieces of data, so it makes sense to keep them together in light of the extra resources required for another entire hash. I suggest accessing the array elements though using constants, so you'd have:

    use constant SCORE => 0; use constant NUM_MATCHES => 1; my $num = $various_scores{$this_document}[NUM_MATCHES];

    Adding further entries, which should occur fairly infrequently, is merely a matter of adding new constants.


    "The dead do not recognize context" -- Kai, Lexx
Re: Hash of Arrays versus Two Hashes
by graff (Chancellor) on Feb 12, 2003 at 03:55 UTC
    I would make a couple of assumptions about your app (I am prone to making assumptions, as my wife will attest...):
    • there would be a "final score" to be computed through some (possibly simple) combination of "feature scores", such as summing, weighted average, etc.
    • there may be cause to add more "feature scores" to supplement the two that you described

    I think these both tend to disfavor an approach where you have separate/parallel hashes, because coding the existing method, and adding extensions later, requires larger, more tedious amounts of script. Much easier to combine numbers that happen to be in one array already, and much easier to just push more numbers onto the array as needed.

    If you're worried about the potential obscurity of using just an array index (0, 1, ...) to mean "single-term frequency score", "combined-term score", ... well, there are easy ways to keep it clear and explicit (#COMMENT IT!) -- at worst, you could use HoH to store the numeric values of each type for each doc, but that seems unnecessary in this sort of case.

    BTW, another feature that is commonly used in searches is the relative "distinctiveness" of search terms: given some basis for knowing the a priori likelihood of each term in general usage, assign more or less weight to its occurrence in a document; for example, if the search terms are "ionized charged", documents that contain only the first should probably score higher than those that contain only the second, and the relative weights of these two terms is inversely proportional to their respective frequencies in some collection of typical text data.

Re: Hash of Arrays versus Two Hashes
by tall_man (Parson) on Feb 12, 2003 at 04:56 UTC
    Here is a measurement script for the two approaches for fairly large hashes (credits to the Anonymous Monk who submitted it in Re: Determining the Memory Usage of a Perl program from within Perl):

    I measured 4352k for the two-hash case and 4944k for the single hash of arrays. It makes sense to me that the hash of arrays costs more because of all those extra references.

    What data structure makes sense for a search engine? I'm not sure either of these do, if you are scoring hundreds of thousands of pages. You don't want to put all those scores and page references in memory and then sort them.

    What might make sense is a heap, in which you can keep the top N best scores partially sorted with a logarithmic insertion time. There's a Heap module for a starting point.

Re: Hash of Arrays versus Two Hashes
by perrin (Chancellor) on Feb 12, 2003 at 03:19 UTC
    I'm having a hard time thinking of a situation where the first way would be better. It uses more memory, and is slower if you actually want both scores. I say number two all the way.
Re: Hash of Arrays versus Two Hashes
by BrowserUk (Patriarch) on Feb 12, 2003 at 12:56 UTC

    Using parallel hashes has the advantages of

    • a) taking less memory

      2 hashes -- 100_000 docs/10 char names/2 numeric values ~= 14MB

      Hash of 2-element arrays -- 100_000/10-char names/2 numeric values ~= 23MB.

    • The named hashes indexed by doc.name are somewhat clearer than a docs hash indexed by doc.name, sub-index by mysterious numbers-- though that can easily be alleviated by using

      use constant TOTAL_WORD_SCORE => 0; use constant NUMER_OF MATCHES => 1;

    You pay the price for all those 2-element arrays. However, if the types of information you are going to store about each document is likely to extend, then there are rapidly diminishing returns from using parallel hashes.

    By the time you get to 4 hashes -v- 1 hash of 4-element arrays, the penalty of duplicating the hash keys comes to bear with the former taking ~= 25MB and the latter ~= 30MB.

    So it's easy to see that from the extensibility point of view, the HoA's comes out a winner pretty quickly and you should probably only consider the former if you're pretty certain that you won't need to extend and/or you're really up against it memory wise.


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Hash of Arrays versus Two Hashes
by dragonchild (Archbishop) on Feb 12, 2003 at 15:42 UTC
    Is it just me or does this problem just scream OO? You have a bunch of data and a bunch of non-trivial work to do on that data. Why not create an object for each document's scoring? That way, you have a bunch of objects. You then have a really neat way of determing the list to display.
    my $search_string = $cgi->param('search_string'); my @display_these = map { $_->[1] } # Strip off the ST sort { $a->[0] <=> $b->[0] } # Do the actual sort grep { $_->[0] > 0 } # Remove anything that scores a 0 map { [ $_->score($search_string), $_ ] } # Create the modified ST @document_objects;
    The object would have a list of its words and how it scores. The object will probably be a hash (if you're lazy) or an array (if you're worried about space/speed). You'll have a hash of words that will key into hashrefs (cause they take less space when small) for all the ways something can score for that word. I dunno. :-)

    Plus, you left out the part where this is really a 3-level structure, not a 2-level structure. You probably don't think of it that way, but you are de-referencing one level before you even get to the code posted above.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.