Is the reason for all those thousands of simultaneous file handles to sort the content of some single large input into thousands of distinct output buckets? If so, is the input an existing file on disk, or is it a "live" streaming source that needs to be sorted on a continuous basis?

For splitting up the contents of a large, complicated input file, I'd do one pass to build an index for the records to be sorted: the output of this pass is a stream of lines containing "bucket_name start_offset byte_length" for each distinct input record; then I would sort the index by bucket_name, and use a second-pass script that does a "seek(...); read(...)" on the big file for each line in the sorted index. Because of the sorting, all the records intended for a given bucket would be clustered together, and I only need to have one output file opened at a time. On the whole, this is likely to work a lot faster than any alternative, because there will be less file/io/system overhead.

If dealing with a continuous input stream, where two passes over the data might not be practical (and the number/names of potential output buckets might not be known in advance), I'd probably switch to storing stuff in a database, instead of in lots of different files -- a mysql/oracle/whatever flat table with fields "bucket_name" and "record_value" might suffice, if you build an index on the bucket_name field to speed up retrieval based on that field.

Either way, I'd avoid having thousands of file handles open at the same time. There must have been some good reason why every OS has a standard/default limit on the number of open file handles per process, and circumventing that limit by orders of magnitude would, I expect, lead to trouble.


In reply to Re: maximum number of open files by graff
in thread maximum number of open files by Fisch

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.