in reply to Re^5: Working on huge (GB sized) files
in thread Working on huge (GB sized) files

Apparently we have different assumptions about the number of temporary files needed, the size of those files, the number of records which can be read from a file in one block.
And then both sorted files have to be read, merged and written to the final merged file.
The data has to be read twice and written twice. The merging into to final form happens during 2nd write. Now of course if this data set is so humongous that multiple merge passes are required (too many files to do it at one time), then that's going to take more RW cycles! Depends upon how humongous the data set is. My assumption is that all temp files can be merged in one operation. X records from each file are queued in memory - just select lowest record of the top of all of them. repeat until one queue runs dry. recharge with another 100MB of data or however much fits in each queue, etc... The key here is that issuing one read for a bunch of records is way faster than many trips through read. Of course the problem here is that last record may not be complete and the code has to deal with that. Read cache is most likely counter productive and what I'd want is the memory mapped I/O straight from the disk.

If size of sort block is 600MB, size of data is 3GB. A simple approach results in 5 temp files (600 MB each). That's small enough number that say merge queues are 100MB, all 5 files open at once...requires 30 block reads (100MB) operations. That's what I meant by sequential blocks - the files to merged would be read not record by record but by a series of records, what I called a "block". Number of records in block is unknown depends upon data. Reading the next say 100MB of data from a file is more efficient than seeking around trying to pick off 1KB pieces at random places.

Of course all of these parameters matter. If thing are huge and we get into say 50 temporary files, things are going to slow way down because more merge passes are required. Neither of us know what any of these numbers really are in the OP's application.

And all of this has to do with how smart or not smart the system sort is on the OP's system. I wouldn't recommend that the OP recode it. For all I know the OP has a .csv file and just needs to type in the proper one liner to sort his data.

I think we've got the major issues out in the discussion.

Replies are listed 'Best First'.
Re^7: Working on huge (GB sized) files
by BrowserUk (Patriarch) on May 13, 2011 at 23:18 UTC
    The data has to be read twice and written twice. The merging into to final form happens during 2nd write

    I'm sorry, but yet again, no.

    The only reason to sort, is if the dataset is too large to fit in memory. Otherwise there is absolutely no good reason to use an O(N logN) sort algorithm when an O(N) hash algorithm does the job.

    Only once the dataset grown so large that it is impossible to hold all the records in memory at one time, does the sort&merge algorithm have any merit whatsoever.

    And if you cannot hold the whole dataset in memory then you cannot sort it in memory is a single operation. So, you use a disk sort that reads a subset of the data, and writes the sorted subset to a temporary file. Then you read another subset into memory and sort it and write the results to a temporary file. And so on until you have S sorted subsets in temporary files. Now you need to merge those subsets together.

    Read and sort, and write to temp; read from temp to merge and write to sorted. And that produced one sorted file. 2N reads; 2N writes; one sorted file.

    Now you need to repeat that for the seconds file. 4N reads; 4N writes; two sorted files.

    Now you need to read both sorted files, merge them and write the final merged output.

    That's 8N reads; and 8N writes; one resultant merged file.

    Total 16N IOPs. Compared to 6N IOPs for the hash&memory algorithm.

    And if, after 3 attempts of my trying to explain this to you, you still cannot see it, please keep it to yourself, because you are simply wrong. Sorry.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.