mast has asked for the wisdom of the Perl Monks concerning the following question:

I am in the middle of a little script I'm writing that apparently needs more storage space than a standard tie'd hash can give me. This:
$myhashinfo=new DB_File::HASHINFO; $myhashinfo->{bsize}=65535; unlink "revmap_check"; tie(%diskrevmap, "DB_File", "revmap_check", O_CREAT|O_RDWR, 0666, $myh +ashinfo) or die "Unable to build on-disk database tie. $!\n";
... doesn't give me enough storage space, despite the fact that the {bsize} is as big as it can possibly get. Any larger (even a single integer) and it refuses to tie for me. At the moment, though, my script generates the following errors:
HASH: Out of overflow pages.  Increase page size
It takes hours to run this script, so I have two questions:

1. Is my data going missing when those errors are generated?
2. What can I use that will give me a tie'd hash, and be able to store many GB of data in it?

Replies are listed 'Best First'.
Re: Giant Tie'd data structures
by BrowserUk (Patriarch) on Oct 26, 2005 at 01:52 UTC

    For you to be getting that error, you must be storing (and hashing) individual items that are longer than 64k each. The recommendation for the setting of the pagesize (bsize) parameter is that it should be set to 4x the size of your estimated biggest element (with lower/upper bounds of 512/64k).

    It's generally not a good idea to hash/index the entirity of entities that size. For most applications there is some obvious subset of each item that can be used as a key to the item. At worst, you could MD5 the item and use that as an index to the item and store the items themselves in individual files or a fixed record length file sperately and use the hash to look up the file/record number and load it seperately.

    Anyway, you might find this page useful.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thank you all for your kind words and your assistance. For more detail about what I'm trying to do;

      I have a large dataset of related files, stored in what is essentially a flat file. There is a "source" file, and a "destination" file, and each pair is listed on a single line. There may be many destination files for a single source file. (Read: the same source file may be listed more than once, but the destination files are all unique.)

      There is a characteristic of some of these files that is undesirable: it is Unicode, or is some other, similarly odd filetype about it. I can measure that separately. The fact that some of these files have an odd filetype taints all the destination files as well.

      I am trying to generate a complete list of all the bad files, along with their related, destination files if the bad file is the source of one of these related pairs.

      If I do two scans--one to build the list of bad files, and then one to build the list of related files, it will take the script 12 hours to run, but I will complete the operation successfully. If I can reduce the lengthy scans to just one (by building a tie'd hash or btree that I can practically instantly scan through) I can reduce that to 6.

      Anyway, thanks to the hints here, I switched to a btree tie which does the job no problemo. I will probably attempt to switch back to a hash now that I've learned I was tweaking the wrong tunable, but as long as I have a reasonably successful result, I'm a happy camper.

      Thank you all! :-)
Re: Giant Tie'd data structures
by merlyn (Sage) on Oct 25, 2005 at 21:47 UTC
    A tied hash is just one of many dozens of ways of doing this. Maybe you settled on the technology prematurely before considering the consequences. What are your real requirements? What will you be doing with the data? How are you accessing it? Maybe it's time for a real database (like DBD::SQLite).

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      I need something that can store, and retrieve, items of data with specific keys (which themselves store important data.)

      I've just tried SQLite and it appears to be orders of magnitude slower, but I'm sure I'm just doing something wrong.

      I know bdb can be used to store large GB of data--so which one should I be using? A Tie'd BTree maybe?

      I'd love to educate myself about what the storage limitations are of the various tie methods. Doh! Help!
        SQLite is a lot slower than Berkeley DB. A couple of tips for you:

        The error message says to increase page size. You are setting "bsize" which means block size. The page size parameter is called "psize."

        Also, you can tell it to use a BTree instead of a hash, and this is usually faster as well.

Re: Giant Tie'd data structures
by perrin (Chancellor) on Oct 25, 2005 at 21:50 UTC
    DB_File, aka Berkeley DB, is commonly used to store terabytes of data. I think you're twiddling the wrong parameter here. Berkeley DB can handle a few GBs without breaking a sweat.
Re: Giant Tie'd data structures
by snowhare (Friar) on Oct 26, 2005 at 03:37 UTC

    It would help if you gave some hint of what you are actually doing and what platform you are using. If the problem is that you need more hash entries than DB_File permits or that DB_File won't handle files larger than 2GiB on your platform, then you might try Tie::DB_File::SplitHash.