Re: read and sort multiple files

Replies are listed 'Best First'.
Re^2: read and sort multiple files by matrixmadhan (Beadle) on Dec 01, 2008 at 06:46 UTC
Instead of using merge-sort alone ( as a plain approach ) a hybrid way of using in-memory sorting and merge-sort can be combined and used together. For ex: Out of the total 'n' number of files, sort only 'm' files in memory ( m could be approximately be m ~ n/2, this is just approximation and can increase/decrease based on the in-memory available and threshold value of memory permitted to be used by the process in consideration ) With the approach of combining merge-sort and in-memory sort 1) Both merge-sort and the quickness of in-memory sort are being used 2) No problem of too many files taking too much of memory, as sorting the number of files in memory is now controlled	[reply]
Re^3: read and sort multiple files by ikegami (Patriarch) on Dec 01, 2008 at 16:24 UTC
Instead of using merge-sort alone ( as a plain approach ) a hybrid way of using in-memory sorting and merge-sort can be combined and used together. That's what the post to which you replied already suggested. Out of the total 'n' number of files, sort only 'm' files in memory A 100MB file takes up pretty major chunk of memory already. ~~Remember, if the array isn't preallocated to hold enough lines, twice the size of the data is needed.~~ If I were to re-implement the work in Perl, I'd probably do something equivalent to `cat * \| (cd tmp; split --lines=XXX -bytes=YYY - chunk)` This maximizes memory usage while limiting memory usage. `for f in tmp/chunk* ; do sort $f >$f.sorted ; done` The sorting would actually be done before writing out the chunk. Merge file pairs until only one file remains. Update: I struck out a statement that's probably wrong. There is overhead, but it should be proportional to the number of lines, not number of bytes.	[reply] [d/l] [select]
Re^4: read and sort multiple files by matrixmadhan (Beadle) on Dec 02, 2008 at 11:38 UTC
This maximizes memory usage while limiting memory usage. Sorry, I don't understand the above comment at all. Would you mind explaining that? 2. for f in tmp/chunk ; do sort $f >$f.sorted ; done The sorting would actually be done before writing out the chunk.* This cannot be guaranteed, what if the file is too big	[reply]
Re^5: read and sort multiple files by ikegami (Patriarch) on Dec 02, 2008 at 15:02 UTC
Re^5: read and sort multiple files by ikegami (Patriarch) on Dec 02, 2008 at 15:03 UTC
Re^6: read and sort multiple files by matrixmadhan (Beadle) on Dec 05, 2008 at 18:47 UTC
Some notes below your chosen depth have not been shown here