in reply to Hash entries starting with a given string.

Recently we ended up with a matrix of 4 TB (calibration information for calibrating infrared images). In theory it would have fit on the server but doing matrix operations on a matrix of that size is asking for trouble. So we had to retrace are steps, give it another thought and we could reduce the size dramatically. Turned out that only a fraction of the data was really what we needed.

In your case: 2^40 = 1099511627776.

So this would result in about 10^12 hash entries. I assume you need to do lots of searches in the data. (Why else the hash?)

Now I don’t know how much Perl internally uses for representing a hash. But let’s take a small value like 10 bytes for each hash entry (key value pair). (It’s probably more!). Then this already results in approximately 10 TB’s (2^40 / 1000^4). Clearly a Perl hash is not the way forward.

BrowserUK suggested to use the file system. I have implemented something similar in the past. The size was “only” one TB and it worked fine. But there was a difference; the tree structure was very simple in my case, only a few levels deep and the files stored where relatively big (tiff images). Storing many small files in a deeper tree.... I don’t know.

The best approach to me seems the VLDB approach as suggested by swampyankee. Because that’s basically what you have/want: a VERY LARGE DB.

It's probably a good idea to rethink your problem.

  • Comment on Re: Hash entries starting with a given string.

Replies are listed 'Best First'.
Re^2: Hash entries starting with a given string.
by BrowserUk (Patriarch) on Jul 28, 2008 at 08:46 UTC
    I don’t know how much Perl internally uses for representing a hash. But let’s take a small value like 10 bytes for each hash entry (key value pair).

    For a 5-byte key and a simple integer value, nearly 48 bytes per pair:

    c:\test>p1 use Devel::Size qw[ total_size size ];; $h{ pack 'C*', map int rand 256, 1 .. 5 }=1 for 1 .. 1e6;; print scalar keys %h;; 1000000 print total_size( \%h ) / 1e6;; 47.194364
    Then this already results in approximately 10 TB’s (2^40 / 1000^4). Clearly a Perl hash is not the way forward.

    There is room for ambiguity in interpreting the OPs post. He said: (potentially 2^40 entries if I'm unlucky)

    I'm assuming that by "ascii" he's talking 8-bit chars. And when he suggested 2^40, (256^5. Ie. 5(8-bit)byte keys) was possible, I'm guessing that this is probably a (very) sparsely populated range.

    BrowserUK suggested to use the file system. I have implemented something similar in the past. The size was “only” one TB and it worked fine. But there was a difference; the tree structure was very simple in my case, only a few levels deep and the files stored where relatively big (tiff images). Storing many small files in a deeper tree.... I don’t know.

    Well. If he ever comes back and responds to the suggestion, then I suggest a better mechanism that uses 3 level of directory and a single 512k file (256*256*4) to represent the final 2 dimensions. By reducing the depth of the directory structure from 4 to 3, will reduce the number of directories involved from (max) 4 billion to 16 million. And the number (max) files involved from 1 trillion, to 4 billion.

    In addition, instead of each of those files using 4k to represent a single number, the combined files will use just 4-bytes. And then you make those combined files sparse, and whole 16k ranges (eg. ...aa through ...ay) will be represented by just 4 bytes if they are not used.

    Assuming 32-bit counts and a 10% loading factor, the minimum 4-terabyte storage requirement (fully populated, 32-bit counts) might be reduced to 125GB. And if the possibilities for duplicates allows 2-byte or 1 byte counts, cut that in half or half again.

    But the truth is, I think this question is probably similar to the guy a year or two ago that wanted to generate all the possible 1024x768 24-bit images and then scan through them looking for the ones that showed starlets in interesting poses.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Interesting solution and food for thought. We need to know more about the problem (as usual).