comment on

I was recently trying to optimize memory consumption on a program(when running takes about 200mb ram).
The program uses a 3 key hash.
The keys of the hash are actually words.
And because they are words they repeat in all 3-key pairs
so I thought I should encode them in some way so that I save some space.
So I made a dictionary,first of all I needed a list which given a position would
give me the word,and a hash which given a word would give me the position on the list.
(The plan was for us to replace each 3-key pair of the hash described above with an encoding
of each of the keys,ofcourse,we are searching for a kind of compression so our encoding has
to be smaller than the actual word).
I made those two,but the number of words beeing very big(~28000) it turned out that the
average size of a word(in bytes) sometimes was less than it's "encoding" in an integer.
For example,let us suppose that we reached the word "flat" at iteration 27000(while constructing
the dictionary),then our encoding for "flat" in the dictionary would be 27000 and because
our hash will consider 27000 to be a string then we would actually use more memory for storing
it than the actual word( length("27000") > length("flat") ),so we end up saving nothing,but wasting more.
I think that here,the terrible disadvantage is that hashes don't give the possibility to have
integers(not the integer stored in a string) as keys.
I still wanted to optimize this so I thought,I should use something smaller than numbers(as strings)
so I started generating random strings of length 2 using all the ASCII table(256 characters)
so I had a space of 255*255=65536 keys(and this is convenient because this is more than twice the number of words
I was initially trying to encode).
I used instead of the list @uniq_words a hash %uniq_words and the keys to this hash were
the above mentioned 2-byte random generated strings.
Indeed this saved 2mb from the previous method(and this is clearly because of using keys
which are just 2 bytes compared to previously used keys of ~5 bytes).
But the program was 1mb fat than without any kind compression at all.
I believe there must be some threshold which I must pass before I will be able to see any
advantages that have not unveiled yet but I am not sure of this.
I would be interested in some data structures better suited for storing this ammount of data.
I have read some nice things about prefix-trees and I think I will implement them after all
using perl but perl always has this annoying overhead for any datastructure which is always there
and drags you down but I will not blame perl for beeing the program beeing a resource hog,instead
I will(for the moment) accept my lack of knowledge.

In reply to memory consumption skyhigh... by spx2

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.