Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

A more memory efficient storage structure?

by JPaul (Hermit)
on Dec 31, 2002 at 16:04 UTC ( [id://223359]=perlquestion: print w/replies, xml ) Need Help??

JPaul has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all;
I have a piece of code I've been fiddling around with thats designed to emulate natural speech, learning from users input. (Very simply, a learning chatterbox).

I've been surprised by how much memory the data takes up, given how small it is when written to disk. I use twin hashes, storing practically the same data, but in a different order. The script learns a sentence in two directions (front to back, back to front) so it can generate a sentence in either direction from a given keyword.
Right now each hash, on disk, takes up 727k (1.4M "brain") - but when loaded into the hash, takes up a remarkable 16M! (I've loaded the software without data to verify).
My hash is put together like so:

$VAR1 = { 'Word1_Word2' => { 'Sym1' => 3, 'Sym2' => 1 }, 'Word3_Word4' => { 'Sym4' => 3, 'Sym3' => 1 }, 'Word5_Word6' => { 'Sym5' => 1 } };
For comparison, I write every entry to disk in the format:
Word1 \a Word2 \00 Sym1 \00 3 \n
Can you fine gentlemonks suggest a better way of storing data in memory, while also being easy to reference?
My thanks,

JP,
-- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

Replies are listed 'Best First'.
Re: A more memory efficient storage structure?
by pg (Canon) on Dec 31, 2002 at 18:16 UTC
    Let's look at this from Perl internal view, to understand why it uses more memory than you expected.

    When you insert a key-value pair, Perl would use its hash function to calculate a non-unique internal key (the internal key, which hash really used for indexing). Focus on the word "non-unique", which means there would be other hash elements sharing the same internal key.

    Now perl need to allocate memory for this key-value pair. It can just allocate enough memory for this pair, but that would be really slow, as Perl has to allocate/reallocate memory each time when a new key-value pair is inserted. Performance vs. memory...

    What Perl really did is to allocate a chunk of memory to store the pairS related to this internal key.

    You can imaging that there would be lots of "waste" (in terms of memory, not speed) at the beginning, when there are not many key collisions. But the situation will get progressively improved, as more key collisions happen. When there is a key collision, there is a big chance that perl doesn't need to allocate memory for this internal key, as there is still space left to store pairS for this internal key, so instead of allocating new chunk of memory, it just fills up whatever allocated already (this means the actual usage of allocated memory is getting improved. Of course you would have to allocate another chunk, after the first chunk is used up. On the other hand, having too many key collisions is not a good news to the performance.)

    In your case, your data is not big, so we can reasonablly expect there are not many key collisions, and lots of allocated memory are wasted. 16 times is not a surprise, especially when you are using a two-level hash, at the second level, you actually have many hashes. Even worse, if your second level hashes are all those kind of small ones, having 1 or 2 elements, there would be a big waste.
      At some point I need to put together a "How Perl variables use memory" document, I suppose. That might be useful, though less often than one might think.
Re: A more memory efficient storage structure?
by Elian (Parson) on Dec 31, 2002 at 16:25 UTC
    Sounds like an on-disk database is the way to go here. Besides the advantage of persistence, you'll also probably use less memory as only the parts of the DB you use will be brought into memory. It would also mean you could share the database across multiple programs, if you were into that sort of thing. A Berkeley DB file would likely do what you need right now.

    Depending on what you intend for this tool, you might want to abstract the interface to it some, so at a later point you could add more features to teh DB back-end--multiple indices, synonyms, or stuff like that.

Re: A more memory efficient storage structure?
by traveler (Parson) on Dec 31, 2002 at 16:38 UTC
    You might want to look at this node: How to Calculate Memory Needs. There are also some good points in this node. You might try also adding small amounts of data to the amount you've benchmarked: it may turn out that the memory manager on your platform has allocated memory and your program is not using it, yet.

    HTH, --traveler

Re: A more memory efficient storage structure?
by waswas-fng (Curate) on Dec 31, 2002 at 16:15 UTC
    How do you need to access it? Sometimes you can trade speed for memory by using arrays instead of hashes, also you can tie hashes to dbm files etc greatly reducing the amount of memory you use -- another added advantage here is that your data is saved across multiples runs automagically. check out AnyDBM ans tie

    -Waswas
      Greetings;

      Indeed, I never mentioned this point but speed comes first.
      I'm willing to use enormous amounts of memory, right now, because it does respond quickly enough (Generating some 200 sentences in 1 and 1/2 seconds), but if it was any slower than that people using it would probably get frustrated with it.
      This, of course, limits largely what can be done - but I'm still curious if there are solutions to be found.

      JP,
      -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

        You might want to read perlguts, then. :) If speed comes first, be prepared to burn lots of memory. Speed does not rule out a DB-backed solution, however, as they all do caching and will be rather fast (though probably not as fast as all-memory hashes) once the caches are populated.

        Consider using Devel::Size as you try different things to get an accurate (as long as you don't have coderefs in your hashes) reading of memory used by the different structures.

Re: A more memory efficient storage structure?
by JPaul (Hermit) on Dec 31, 2002 at 17:49 UTC
    Hello again;
    I grabbed Devel::Size, and according to it, each of the hashes is 6.1M in size.
    'top' shows the memory usage at 18M, and I don't think the interface is using 6M of space... So I'm not entirely convinced thats quite correct. But it stills gives an idea of the inflation - 727k to 6.1M.
    I've never been able to make BerkeleyDB install correctly (Out of two tries, I believe), so I'm not terribly interested in it out of previous predjudices, but I could certainly entertain using an ondisk DB... I can't imagine it'd be terribly quick for the lookups, but I'll never know until I try.
    (My workstation runs Linux, if that tells anyone anything)

    JP,
    -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

      Do be aware that Devel::Size does count shared data in part of its size calculation when figuring out how big what you're checking is, but it only counts it once per call, so if the two hashes share some data, and they might, pass them in together for a correct aggregate total. Something like:
      print total_size([\%hash1, \%hash2]);
      though that'll add in a few dozen bytes for the two refs and the anon array.

      Also, don't rule out the possibility that there's a lot of scrap memory hanging around that Perl's not cleaned up after quite yet, or that you've not noticed.

      You only need BerkeleyDB if you want the newer features (transactions,record locking,etc). Perl ships with BerkeleyDB, look in the directory ext/DB_File for the version.

      Getting newer versions to install was a pain in the a**, the problem it turned out for me was which db.h file was being included in the compilation, so if you want the newer features, that's a good place to start looking for problems.

      -Lee

      "To be civilized is to deny one's nature."
      SDBM_File is actually much faster than DB_File or BerkeleyDB when dealing with a single process, and it comes standard with Perl on most systems. However, if speed is the most important thing, you should just stick with what you have. The memory inflation is not unusual.
Re: A more memory efficient storage structure?
by runrig (Abbot) on Dec 31, 2002 at 23:35 UTC
    All those one or two element nested hashes are wasting alot of memory (I assume there's alot of them). Is there any reason you couldn't concatenate the 'Sym#' onto the main hash key? You might save 40-50% in memory.
Re: A more memory efficient storage structure?
by Anonymous Monk on Jan 02, 2003 at 17:13 UTC
    The obvious lazy-caveman's solution would be to simply take the data that you write to the hard disk(flatten the hash )and write it to a variable :->, I bet that you could access it with regexes very fast! But...since it looks like youre not having a caveman situation:

    From your example it looks like you use numbers quite a bit and hashes are just horrible for numbers, I suggest you use more arrays. And make sure that you dont have unneccessary variable names (the names themselves do take up a lot of space...) I also agree 100% with runrig.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://223359]
Approved by krujos
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-03-28 17:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found