Big suggestion. Your first step should be to take your data and do a mergesort to get it into order. Mergesorts sequentially process data in order, and therefore represent a best case for standard pre-fetch/caching algorithms used in filesystems and on disk. Inserting sorted data into a BTree again causes filesystem caching to do you a world of good.

After that I'd expect your access speed to be better than you're planning on. Smart BTree implementations don't have a constant branch factor - instead they put as many branches into a page as they can. Even with the amount of data that you're facing, I think that a BTree should require no more than 5 page accesses to pull your data. And better yet, the root pages are usually going to be in memory so you'll probably average under 2 hits to disk for data. (Plus if there is some locality of reference in how it is used, then it could be less.)

As for scaling, you can get a lot out of a bit of parallelism. The fact that one request needs to wait disk is no reason that another request cannot be being served as well. One good architecture for that looks like a dedicated server with several threads or processes - you connect to one of them and it serves you. Unbeknownst to you, others are doing the same. (I'd strongly recommend against using BerkeleyDB in a CGI environment because of problems with server start/stop. In a mod_perl environment is fine, but in a CGI environment there are potential issues that you don't want to hit.)

Of course if you're going to talk about that, then you might as well upgrade to the invented wheel of a decent RDBMS with row-level locking. Modifying this recipe to an RDBMS, create your table, and create an index on it that will internally be a BTree. (How you do that varies by database - see a DBA and/or documentation.) Then take your dataset, do the mergesort on your own, and proceed to insert it into that table in order.

If you've lined things up right, the load should proceed at reasonable speed, and when you're done the database has already figured out how to coordinate having multiple processes making requests of it at once.


In reply to Re^3: size on disk of tied hashes by tilly
in thread size on disk of tied hashes by danderson

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.