Dear monks,

I need to store a huge amount of data having a fixed structure:

  • Each item has a unique (alphanummeric, 7-bit-ASCII) id
  • A fixed number of "meta" information fields contain numbers or text data up to 100 bytes (worst case, usually <30 bytes)
  • meta information won't change once the item has been created
  • Each item has two text parts usually 2-16k in size, somethimes some MB, but up to 2 GB have to be supported
  • The text parts are delivered in blocks up to a predefined size limit (currently about 16 MB, but may be changed to anything from ~1k if storage requires a change), currently typically 1900 bytes
  • The final text part size is unknown, same for the number of blocks
  • The blocks may not arrive in sequential order, but they contain a sequence number starting from zero for each item, every sequence number is used
  • Up to 10 mio. items should be stored at the same time, maybe more in the future
  • About 90% of the items may be deleted some weeks after they were created
  • Some of the remaining are deleted later, few are kept forever
  • Each item must be accessible quickly by unique item id
  • Deletion of items may be really slow
  • I considered using MongoDB, but it's becoming slow for 15+ mio. items and has a 16 MB limit per item. mySQL can't handle this amount, too. I'd like to store the stuff in files, but avoid one file per item as these many files are hard to handle for filesystems.

    I considered tie and GDBM_File which is rock solid on reading, I could store many items in one file, delete them and append/insert text blocks as they are arriving, but GDBM is critical when more than one process is writing the same file and I'm not sure that no two process will ever write the same file as new text blocks are arriving for different messages.

    Any suggestions?


    In reply to Store a huge amount of data on disk by Sewi

    Title:
    Use:  <p> text here (a paragraph) </p>
    and:  <code> code here </code>
    to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.