in reply to A (memory) poor man's <strike>hash</strike> lookup table.


Over the weekend I did a little work on this. I loaded your million numbers into a digit trie (a Patricia Trie restricted to the digits 0-9) optimized for reduced memory consumption. The records loaded in around 100 seconds on a 2.2 gighz machine (if temporary space is not an issue this time drops dramatically), and ultimately required less than 5MB to store. I calculated the ratio of bytes used to store the data versus the sum of lengths of the keys and it was 0.62, meaning that the data structure requires 0.62 bytes to store one byte of key. Lookup time in the structure is comparable (or better) than a raw hash (I will admit the implementation of the lookup is done in Inline::C/XS).

Note that this Trie implementation may not be optimal. Because it stores the keys as digits strings instead of byte encoding the integer and then doing a byte-trie its probably less efficient a trie implemented as the latter. You could probably calculate the storage requirements but I have to admit I cant be bothered, however I will hazard a guesstimate that it would be around 2.5MB. Also this structure would be even faster to do lookups on (the issue of packing the integers as necessary before lookup aside).

Also, these calculations are _purely_ about storing the keys. Not about storing the key along with a value. Assuming this was required and the values to be stored were long ints you could add 4MB to both numbers (storing them as packed longs). This would mean the digit trie required total 8.5 mb and the byte trie would come in at 6.5MB. (Actually im not bothering to calculate the size of an SV here. Presumably you would need a small amount of ram, say (a very generous) 1k for the object that holds the two char arrays.)

Which actually brings up a different point. Instead of storing your million numbers in an array you could store them in a packed string. This would mean the total memory requirement would be 4mb, and would still facilitate binsearching. Finger points into the data may speed that search up drammatically.

The discussion with Judy arrays led me to rereview the subject. Judy arrays are going to be slightly less efficient than a pure byte-tree in terms of storage. I think it would be safe to say that for data packed this densely the minimum bound for memory utilization (without bringing compression into the picture) is going to be the numbers you get from a Trie. However its worth noting that this scenario is realtively unrealistic. With lower density data sets my understanding from what ive read on Judy arrays would suggest they would be much more efficient.

I wanted to say that I found this thread thought provoking and interesting and I hope you post more along its line. The subject of optimisation techniques is an interesting one that is IMO useful and worth discussing. All the old caveats about optimizing are correct, but the fact comes that once you identified an optimization target having discussion and documentation of ways to handle it can be very useful.

PS: Its worth noting that conceptually a Trie is essentially a non cpaturing DFA regex engine anchored at the beginning of the string. Its not difficult to extend a trie to handle Kleene closure (*) and one-or-more (+), and only slighlty more difficult to make it handle non anchored (^) searches as well. The advantage of this is that its more efficient for matching many constant strings than the NFA regex engine that perl uses. (Although of course with serious deficiencies.) Ive long wished that perl handled this internally, allowing for a modifier to cause a regex pattern to be treated as a DFA an not as an NFA, which would essentially mean putting a trie engine into perl.

Its weird that these things are related (who would guess there is a direct family tree relationship between regular expressions and judy arrays!) In fact you could take this point further and argue that really there is a relationship between linked lists (degenrate binary trees) and regular expressions! Woo-hoo! Talk about 6 degrees of seperation. ;-)



    First they ignore you, then they laugh at you, then they fight you, then you win.
    -- Gandhi

• Update:  
Bart suggested breaking the paragraphs up, and pointed out some typos. Thanks Bart.


Replies are listed 'Best First'.
Re: Re: A (memory) poor man's <strike>hash</strike> lookup table.
by BrowserUk (Patriarch) on Dec 03, 2003 at 00:21 UTC

    Hi demerphq.

    Sorry for taking so long to get back to you but I had to find out what Tries and Patricia Tries were before I could understand your post and appreciate the effort that you had expended in researching it. Thanks for taking the idea seriously.

    It turns out that I had actually implemented something similar a few years ago for a Random Testcase generator -- though I didn't know they were called Tries. The API set being tested consisted of around 1000 functions, grouped into several sub-apis with each function having a 3-character prefix (Prf..., Win..., Gpi, Dev... etc. *) designating its group. The verbs and adjectives that made up the rest of the names formed a fairly small set, so by substituting a numeric value for each of these and the prefixes, and using these numeric values to build a tree, the storage requirements and the search/traversal times were hugely reduced. This was back in the days when 8 MB machines were big, 16 MB machines almost unheard of and 20 Mhz CPU's were top of the range. We needed to conserve memory, and use smart algorithms rather than brute force, in order that things would run in a reasonable time.

    *They may be familiar to anyone who programmed OS/2.

    I'll make no claims to be fully conversant with Tries (of any flavour) until I've coded one myself from scratch at least 3 or 4 times, but I think that I've understood enough to draw the following conclusions.

    In order for Tries to be space efficient, it requires that the domain of the keys be fairly fully utilised. My example of asciized integers lends itself to this rather nicely resulting in a uniform distribution of the keys across the entire domain. However, if the characters making up the keys used the full 7 or 8-bit range of ascii characters, then one million keys would likely result in a sparsely populated tree, and the amount of unused space in the tree would probably outweight the used space by a fairly large factor. If the keys are Unicode, the problem gets much worse very quickly.

    For example. If the keys consist of the 96 non-control characters in the 7-bit ascii set, and if we had 1 million, 4 character keys, the domain space would potentially be 84,934,656 giving a utilisation of 1/84th of the total domain space. And that's the problem with any data structure that utilises fixed size tables.

    I realise that most Trie implementations, in common with Judy arrays, utilise various, more or less complicated, compression mechanisms to avoid such wastage, but these impose there own limitations and burdens. Judy arrays use a very complicated 6-level compression mechanism. The bit twiddling, branching and special casing required in these schemes means that you couldn't efficiently implement them at the Perl level. If I've understood them correctly, writing an implementation of Tries such that it was either generalised enough to handle this, or could be configured at runtime by passing a parameter would be extremely difficult.

    The nice thing about the mechanism I outlined is that it is quite easily tailored to different domains, by choosing how many (and which) subset of the keys are used for the initial indexing. In the version I currently have coded, this is an unpack format that is used to break the supplied key into 2 or 3 parts. Using cees insight at Re: A (memory) poor man's <strike>hash</strike> lookup table. of removing the index from the key prior to storage reduces the memory consumption further from my original figures. It also speeds up the linear search part as he predicted. For the example million integers, the tie statement looks like this

    tie my %hash, 'Tie::CompactHash', 'A0 A4 A*';

    The 'A0' part indicates that there is no common prefix that can be factored out. The 'A4' indicates that 4 characters (in this case digits) are used as the primary index spreading the 1 million keys over 10,000 scalars (in a real hash). The final 'A*' indicating that the rest of the key is stored within the strings for linear searching. Also stored in the same string is an index into a second string indicating where the value for this key is stored. The values are stored using the 'n/A*' pack format, storing the length and value of the value. This allows both the keys and values to be of variable length, as is the overall size of the hash. A special number is used as the index to indicate that key has no value.

    For the 1 million 4x 7-bit character keys, it would look like this

    tie my %hash, 'Tie::CompactHash', 'A0 A2 A*';

    Here, again there is no common prefix, the index uses 2 characters mapping the million characters over 9216 (96²) strings, with the other two characters stored in the string along with the index into the values string.

    And a final example based on the post I cited in my original post. The problem involved hashing log filenames, which I speculated are often named along the lines of XYZ20031201.log.

    tie my %logs, 'Tie::CompactHash', 'A3 xx A4 A2 A*';

    In this case,

    1. 'A3' is a prefix common to all the log files. This is extracted from the first key passed and stored internally. For subsequent keys, this value is verified against the stored value, but is otherwise discarded.
    2. 'xx' skips over and discards the century. For most applications, ignoring this won't harm.
    3. 'A4' are the low 2 digits of the year, plus the month. For 10 years of logs, this maps to 120 strings.
    4. 'A2' is the day of month ensuring a maximum of 31 subkeys in each string.
    5. 'A*' is a common postfix which is discarded.

    This assumes that the filenames matched will be preselected (using glob or readdir etc) so that those parts ignored will not be significant. This is done to show the flexibility of the mechanism. Internally, the format passed on the tie is stored and then used to break up the key on input. This is efficient and flexible. For example, the exists function is implementated as

    sub EXISTS { my( $pre, $key, $post ) = unpack $mask{ $_[SELF] }, $_[KEY]; die "Prefix mismatch $pre ne $pre{ $_[SELF] }" unless $pre eq $pre{ $_[SELF] }; return defined $_[SELF]{ $key } and $_[SELF]{ $key }[ 0 ] =~ m[(...\c@)\Q$post\E\xff]s; }

    This use of perls standard hash mechanism to handle the sparse arrays problem and regexes for the final linear search, as both are built-in and heavily optimised means that once invoked, they run at the full speed of a C solution.

    I might try coding it in C or possibly assembler and see how it compares for performance, but the major limitation would still the tie interface. I've also tried coding my perl version using an OO interface to see what if any performance difference that made -- it was much of a muchness -- but the syntax of the interface was much less convenient than the tied version. I wish the tie mechanism performed better. It lends itself to so many useful extensions of perls built-in types, it's a shame the costs are so high.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail