Whenever you're considering a design remember the following. Seeks to disk cost 0.01 seconds each, but it is reasonable to stream data back and forth at 50 MB/s. Therefore it is worth doing a lot of extra work to be able to stream data rather than seeking to disk.

If you've done things right, for large data sets your time is entirely dominated by the time to stream through data. So 1 file vs 100 files is irrelevant. But splitting directly into 100 files may be a horrible idea for the simple reason that disk drives are typically able to stream data at high rates to a fixed number of locations. Like 4 or 16. So you'd probably want to split the data in multiple passes if you went with this design. (That is not to say that this is the right design. Personally I head in the merge sort direction rather than using hashing.)

As for going to a database, my experience is that when your data sets are near the capacity of the machine, databases often will run into resource constraints and not figure a way out. It isn't that the query runs painfully slowly, it is that it grinds away for several hours then the query crashes. That is one of the prime reasons that I have needed to do end runs around the database when working with large data sets.


In reply to Re^7: Working with large amount of data by tilly
in thread Working with large amount of data by just1fix

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.