spx2 has asked for the wisdom of the Perl Monks concerning the following question:

I was recently trying to optimize memory consumption on a program(when running takes about 200mb ram).
The program uses a 3 key hash.
The keys of the hash are actually words.
And because they are words they repeat in all 3-key pairs
so I thought I should encode them in some way so that I save some space.
So I made a dictionary,first of all I needed a list which given a position would
give me the word,and a hash which given a word would give me the position on the list.
(The plan was for us to replace each 3-key pair of the hash described above with an encoding
of each of the keys,ofcourse,we are searching for a kind of compression so our encoding has
to be smaller than the actual word).
I made those two,but the number of words beeing very big(~28000) it turned out that the
average size of a word(in bytes) sometimes was less than it's "encoding" in an integer.
For example,let us suppose that we reached the word "flat" at iteration 27000(while constructing
the dictionary),then our encoding for "flat" in the dictionary would be 27000 and because
our hash will consider 27000 to be a string then we would actually use more memory for storing
it than the actual word( length("27000") > length("flat") ),so we end up saving nothing,but wasting more.
I think that here,the terrible disadvantage is that hashes don't give the possibility to have
integers(not the integer stored in a string) as keys.
I still wanted to optimize this so I thought,I should use something smaller than numbers(as strings)
so I started generating random strings of length 2 using all the ASCII table(256 characters)
so I had a space of 255*255=65536 keys(and this is convenient because this is more than twice the number of words
I was initially trying to encode).
I used instead of the list @uniq_words a hash %uniq_words and the keys to this hash were
the above mentioned 2-byte random generated strings.
Indeed this saved 2mb from the previous method(and this is clearly because of using keys
which are just 2 bytes compared to previously used keys of ~5 bytes).
But the program was 1mb fat than without any kind compression at all.
I believe there must be some threshold which I must pass before I will be able to see any
advantages that have not unveiled yet but I am not sure of this.
I would be interested in some data structures better suited for storing this ammount of data.
I have read some nice things about prefix-trees and I think I will implement them after all
using perl but perl always has this annoying overhead for any datastructure which is always there
and drags you down but I will not blame perl for beeing the program beeing a resource hog,instead
I will(for the moment) accept my lack of knowledge.

Replies are listed 'Best First'.
Re: memory consumption skyhigh...
by PetaMem (Priest) on Mar 31, 2008 at 05:49 UTC
    In general: Your problem might not be the keys, but the values of your hash.
    #!/usr/bin/perl use warnings; use strict; use Devel::Size qw(size); my %hash = (); for my $key (1..27000) { $hash{"blahblahblahblah$key"} = 1; } print size(\%hash);
    The keys are really not small here and they also contain a hell of redundant data. But size for the hash speaks of 1736142 bytes. It also has 27.000 entries like yours. I think your optimizing on the wrong part of your data structure.

    Bye
     PetaMem
        All Perl:   MT, NLP, NLU

      Example of what am I trying to do and also of the data I'm handling In generate.pl freq_tuple3 is the 3-key hash I was talking about.
Re: memory consumption skyhigh...
by Zen (Deacon) on Mar 31, 2008 at 14:55 UTC
    Use a database if you'd like to work with large datasets. True for any language.
Re: memory consumption skyhigh...
by TGI (Parson) on Mar 31, 2008 at 19:22 UTC

    Your data set could include up to 21,952,000,000,000 points if fully populated. Even though you are working with a sparsely populated version of that possible set, the number is still going to be huge. If we assume that each word will only form digraphs with 1/5 of available words, that leaves you with 175,616,000,000 data points to count. You are suffering from a combinatorical explosion.

    Zen recommended using a database. I think that is good advice. MLDBM looks like a nice fit.

    Update: Calculations based on 28000 words, not 27000. Credited Zen by name for his advice.


    TGI says moo

Re: memory consumption skyhigh...
by mkirank (Chaplain) on Mar 31, 2008 at 22:02 UTC