in reply to Giant Tie'd data structures

For you to be getting that error, you must be storing (and hashing) individual items that are longer than 64k each. The recommendation for the setting of the pagesize (bsize) parameter is that it should be set to 4x the size of your estimated biggest element (with lower/upper bounds of 512/64k).

It's generally not a good idea to hash/index the entirity of entities that size. For most applications there is some obvious subset of each item that can be used as a key to the item. At worst, you could MD5 the item and use that as an index to the item and store the items themselves in individual files or a fixed record length file sperately and use the hash to look up the file/record number and load it seperately.

Anyway, you might find this page useful.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Giant Tie'd data structures
by mast (Novice) on Oct 26, 2005 at 16:48 UTC
    Thank you all for your kind words and your assistance. For more detail about what I'm trying to do;

    I have a large dataset of related files, stored in what is essentially a flat file. There is a "source" file, and a "destination" file, and each pair is listed on a single line. There may be many destination files for a single source file. (Read: the same source file may be listed more than once, but the destination files are all unique.)

    There is a characteristic of some of these files that is undesirable: it is Unicode, or is some other, similarly odd filetype about it. I can measure that separately. The fact that some of these files have an odd filetype taints all the destination files as well.

    I am trying to generate a complete list of all the bad files, along with their related, destination files if the bad file is the source of one of these related pairs.

    If I do two scans--one to build the list of bad files, and then one to build the list of related files, it will take the script 12 hours to run, but I will complete the operation successfully. If I can reduce the lengthy scans to just one (by building a tie'd hash or btree that I can practically instantly scan through) I can reduce that to 6.

    Anyway, thanks to the hints here, I switched to a btree tie which does the job no problemo. I will probably attempt to switch back to a hash now that I've learned I was tweaking the wrong tunable, but as long as I have a reasonably successful result, I'm a happy camper.

    Thank you all! :-)