> Geez, as a middle point, 5,000 files times 4 MB each is 20,000 MB => 20 GB.
Yep. I get data sets greater that 25 GB all the time.
>You are writing that much data to the file system in the first place.
The bottleneck (of course) comes from writing the data (to an old-fashioned hard disk - things would obviously proceed much more quickly if I had 64 GB of RAM and could write everything to a RAM disk :) ), so I never considered the possibility that writing 7000 files slowed things down much.
> Your app is much bigger than I thought.
I wouldn't describe the app as large at all (obviously the sets of data weigh in heavily). :) Beyond the loops, reading in a one-line second file and a small (less than 1 MB) third file, testing for EOF on the third file, and setting a variable based on changing values in the third file, the remainder (and only significant part) of the script consists of a one line system call. The entire script doesn't exceed 1500 bytes, commented, and not obfuscated in any way.
> But a DB can handle that, even SQLite for 64 bit processor.
Again, the external programs that do the subsequent processing don't work with databases, but I will definitely keep the technique in mind.
> The processing that you do with this data is unclear as well as the size of the result set.
I have kept that rather vague (nothing illegal, but the possibility exists of a "terms of use" violation somewhere along the way). :) The final result ends up as only slightly smaller than the input files (with reformatting and some selective discarding).
That said, I have come up with a few ideas for the "upstream" script which would completely eliminate the need for this one. It would replace the naming system with a different one that would maintain sort order, and could combine some of the files at acquisition time (doing the processing in that script that this script does) that would reduce the number of files (but not the total size of the data) by about an order of magnitude.
Thanks again for your help and suggestions. |