in reply to Re: Hash entries starting with a given string.
in thread Hash entries starting with a given string.

I don’t know how much Perl internally uses for representing a hash. But let’s take a small value like 10 bytes for each hash entry (key value pair).

For a 5-byte key and a simple integer value, nearly 48 bytes per pair:

c:\test>p1 use Devel::Size qw[ total_size size ];; $h{ pack 'C*', map int rand 256, 1 .. 5 }=1 for 1 .. 1e6;; print scalar keys %h;; 1000000 print total_size( \%h ) / 1e6;; 47.194364
Then this already results in approximately 10 TB’s (2^40 / 1000^4). Clearly a Perl hash is not the way forward.

There is room for ambiguity in interpreting the OPs post. He said: (potentially 2^40 entries if I'm unlucky)

I'm assuming that by "ascii" he's talking 8-bit chars. And when he suggested 2^40, (256^5. Ie. 5(8-bit)byte keys) was possible, I'm guessing that this is probably a (very) sparsely populated range.

BrowserUK suggested to use the file system. I have implemented something similar in the past. The size was “only” one TB and it worked fine. But there was a difference; the tree structure was very simple in my case, only a few levels deep and the files stored where relatively big (tiff images). Storing many small files in a deeper tree.... I don’t know.

Well. If he ever comes back and responds to the suggestion, then I suggest a better mechanism that uses 3 level of directory and a single 512k file (256*256*4) to represent the final 2 dimensions. By reducing the depth of the directory structure from 4 to 3, will reduce the number of directories involved from (max) 4 billion to 16 million. And the number (max) files involved from 1 trillion, to 4 billion.

In addition, instead of each of those files using 4k to represent a single number, the combined files will use just 4-bytes. And then you make those combined files sparse, and whole 16k ranges (eg. ...aa through ...ay) will be represented by just 4 bytes if they are not used.

Assuming 32-bit counts and a 10% loading factor, the minimum 4-terabyte storage requirement (fully populated, 32-bit counts) might be reduced to 125GB. And if the possibilities for duplicates allows 2-byte or 1 byte counts, cut that in half or half again.

But the truth is, I think this question is probably similar to the guy a year or two ago that wanted to generate all the possible 1024x768 24-bit images and then scan through them looking for the ones that showed starlets in interesting poses.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^3: Hash entries starting with a given string.
by dHarry (Abbot) on Jul 28, 2008 at 09:11 UTC

    Interesting solution and food for thought. We need to know more about the problem (as usual).