Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: A memory efficient hash, trading off speed - does it already exist?

by MarkM (Curate)
on Feb 04, 2003 at 00:04 UTC ( [id://232427]=note: print w/replies, xml ) Need Help??


in reply to A memory efficient hash, trading off speed - does it already exist?

Pre-allocation is not your problem. Consider that in a hash with 4 entries, the entire wastage is probably no more than 16 bytes per hash. 16 bytes X 120,000 words = 187.5 Kbytes. 187.5 Kbytes is nowhere near being a significant percentage of 70 Mbytes. What you are observing is the overhead required for each hash, each string, and even, each integer.

Definately, as other people have suggested, moving the data to disk will deal with your memory problem. However, it does not deal with the size of the data itself. Even the most space-efficient *DBM_File modules are not able to compact the data on disk more than a few small percentages tighter than Perl packs the data in memory, and there will be a significant performance hit.

I spent some time coming up with an alternative that stored the word=>score hash as a string of encoded 32-bit integer pairs (id,score), however, I finally decided that this is probably not a good approach unless one knows exactly what one is doing, and the best way to show that, is to implement it. I leave it to you to decide whether research in this area is desirable. Basically, you flatten the hash of hashes into 3 hashes of strings. The idscores string could encode the pairs in a manner that could be searched using a binary search, or perhaps a linear search is adequate. Ideas, ideas...

Replies are listed 'Best First'.
Re: Re: A memory efficient hash, trading off speed - does it already exist?
by JPaul (Hermit) on Feb 04, 2003 at 04:28 UTC
    Can you explain to me why the structures that store the data take over 10 times the size of the actual data itself? I believe you, I just don't understand why.
    What does storing the data in this format provide? Speedy STOREs and FETCHes? Is that it?

    I have a friend busily telling me that if this was C he could simply construct a linked-list struct and wander across the structs, which would undoubtedly take up less memory. While I'm sure thats not as magically efficient as he thinks, and I know the lookups would be phenomenally slow, its hard to convince him :)

    Thanks,
    JP
    -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

      The features that Perl data structures provide over minimal C data structures include at least the following:

      1. Reference counting to ensure prompt and reliable garbage collection. Once the reference count hits 0, the structure can be reclaimed.
      2. Data structures provide storage for multiple form values. For instance, if a string is used as an integer, and as a number, further accesses of the string as either a string, an integer or a number do not require for the string to be converted each time.
      3. In-place upgrading of types. For instance, if a type is blessed, or if a hash is tied, or if an integer becomes a string, all references (active references, not the same as Perl scalar references) remain valid. For comparison, consider the C linked list your friend mentions. Now change the list to a b-tree. Will all data structures that referred to the list now be ok referring to the b-tree? What if the b-tree data structure is bigger than the list data structure and requires being moved?

      Another thing to explain to your friend is that the HASH structure is a form of indexed linked list. First, the key is hashed into an integer form, then the integer is used to produce an index into the array of linked lists.

      The overhead for a standard Perl string is at least 24 bytes (on a 32-bit architecture). This is each of: a reference count (4 bytes), flags (4 bytes), a reference to a more specific data structure (4 bytes), a reference to the string content (4 bytes), the length of the string (4 bytes), the space allocated for the string as it is likely less than the length (4 bytes).

      The overhead for a standard Perl hash is at least: (again, for a 32-bit architecture)

      60 bytes + ((20 + length of average key) * number_of_keys)) + (4 * size of array used within hash) + (24 * length of average value, assuming all string values)

      When I referred to the amount you would save for not pre-allocating hash buckets, I was referring to "4 * size of array" number above, which is almost insignificant compared to the rest of the overhead. The above only describes a hash mapping strings to strings. In your case, you are mapping a string to a scalar reference to a hash that maps strings to numbers.

      Your specific case is pushing for title of 'application that best makes Perl look like a pig.' Usually, Perl is able to achieve significant performance gains by using more memory. In your case, the sheer size of the data structures likely results in the opposite effect.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://232427]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-29 13:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found