For splitting up the contents of a large, complicated input file, I'd do one pass to build an index for the records to be sorted: the output of this pass is a stream of lines containing "bucket_name start_offset byte_length" for each distinct input record; then I would sort the index by bucket_name, and use a second-pass script that does a "seek(...); read(...)" on the big file for each line in the sorted index. Because of the sorting, all the records intended for a given bucket would be clustered together, and I only need to have one output file opened at a time. On the whole, this is likely to work a lot faster than any alternative, because there will be less file/io/system overhead.
If dealing with a continuous input stream, where two passes over the data might not be practical (and the number/names of potential output buckets might not be known in advance), I'd probably switch to storing stuff in a database, instead of in lots of different files -- a mysql/oracle/whatever flat table with fields "bucket_name" and "record_value" might suffice, if you build an index on the bucket_name field to speed up retrieval based on that field.
Either way, I'd avoid having thousands of file handles open at the same time. There must have been some good reason why every OS has a standard/default limit on the number of open file handles per process, and circumventing that limit by orders of magnitude would, I expect, lead to trouble.
In reply to Re: maximum number of open files
by graff
in thread maximum number of open files
by Fisch
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |