in reply to Re^5: Working on huge (GB sized) files
in thread Working on huge (GB sized) files
And then both sorted files have to be read, merged and written to the final merged file.The data has to be read twice and written twice. The merging into to final form happens during 2nd write. Now of course if this data set is so humongous that multiple merge passes are required (too many files to do it at one time), then that's going to take more RW cycles! Depends upon how humongous the data set is. My assumption is that all temp files can be merged in one operation. X records from each file are queued in memory - just select lowest record of the top of all of them. repeat until one queue runs dry. recharge with another 100MB of data or however much fits in each queue, etc... The key here is that issuing one read for a bunch of records is way faster than many trips through read. Of course the problem here is that last record may not be complete and the code has to deal with that. Read cache is most likely counter productive and what I'd want is the memory mapped I/O straight from the disk.
If size of sort block is 600MB, size of data is 3GB. A simple approach results in 5 temp files (600 MB each). That's small enough number that say merge queues are 100MB, all 5 files open at once...requires 30 block reads (100MB) operations. That's what I meant by sequential blocks - the files to merged would be read not record by record but by a series of records, what I called a "block". Number of records in block is unknown depends upon data. Reading the next say 100MB of data from a file is more efficient than seeking around trying to pick off 1KB pieces at random places.
Of course all of these parameters matter. If thing are huge and we get into say 50 temporary files, things are going to slow way down because more merge passes are required. Neither of us know what any of these numbers really are in the OP's application.
And all of this has to do with how smart or not smart the system sort is on the OP's system. I wouldn't recommend that the OP recode it. For all I know the OP has a .csv file and just needs to type in the proper one liner to sort his data.
I think we've got the major issues out in the discussion.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^7: Working on huge (GB sized) files
by BrowserUk (Patriarch) on May 13, 2011 at 23:18 UTC |