in reply to Re^3: Working on huge (GB sized) files
in thread Working on huge (GB sized) files

Good discussion. I agree there is certainly some guess work involved in this.

Disk-based sorts get around the memory problem by reading a chunk of the source, sorting it and writing it to a temp file; read another chunk, sort and write to temp.
Yes, agree.
Whilst the reading and writing of the chunks are sequential, they are interleaved, and that alone has a substantial performance impact over reading the whole sequentially and then writing the whole sequentially.
I am not so sure about this. I guess that what you mean by interleaved is that program switches back and forth from read to write. Typically the huge input file would be read sequentially, but by reading big hunks (maybe thousands of records at a time). Sort. Then write big hunk to a new file. So data gets read once and written once.

In the create hash table scenario, the data only has to be read once - there is no write. Presumably the data is also read in big hunks, say same size hunk as above.

The system will certainly allow other I/O processing right after the "big block" request has been satisfied. The system probably is also interleaving other disk I/O requests while the "big block" operation is proceeding. That's true in either scenario above. So when 2nd read needs to happen in either case, the disk is most likely somewhere else and not sitting right at next block to be read. I see a 2x difference, which is "substantial". I don't know if you meant that it was a lot more than that or not.

There are of course all kinds of things that can be happening cache wise. Not clear how much if any of the write data will be kept once it is flushed to disk as the file presumably is opened as write only as a big clue. In both scenarios there could also be read-ahead happening.

I think the big difference is that in the hash table scenario, each disk operation is going to be getting a random record, maybe 2 different records from different parts of the file. Whereas, the merge is going to be getting thousands of sequential records and does this when one of its n queues runs dry. So it looks to me like the hash approach has the potential for generating many more disk operations than the sort/merge - enough to swamp this initial 2x I/O difference.

How all of this is going to work out is very dependent upon what system it is running on, how the caching is set up and what other processes are doing at the same time, how much of the total system resources this collation process is able to use... I think this is going to come down to a YMMV situation. And although the sort is a Nlog N operation versus an N operation for scanning and building the hash, the whole job is going to be dominated by disk I/O time. The CPU operations won't matter even if there is a heck of a lot more of them because they are so fast compared with the disk. I suspect the hash approach uses less actual CPU time, but takes more "wall clock" time.

Replies are listed 'Best First'.
Re^5: Working on huge (GB sized) files
by BrowserUk (Patriarch) on May 13, 2011 at 20:14 UTC
    . So data gets read once and written once.

    No. You forgot that the sort needs to merge back the temporary files in to the sorted output file.

    So each record is read once sorted and written once to a temporary file. It is then read again from that temporary file during the merge and written to the sorted output file. So that two reads and two writes.

    And the same has to be applied to the second file.

    And then both sorted files have to be read, merged and written to the final merged file.

    I think the big difference is that in the hash table scenario, each disk operation is going to be getting a random record, maybe 2 different records from different parts of the file. Whereas, the merge is going to be getting thousands of sequential records and does this when one of its n queues runs dry.

    Again, no. You seem to be assuming that the merge consists of: reading temp file 1 & writing it to merged file; reading temp file 2 & writing it to merged file; and so on. It doesn't.

    It requires reading the first record (or block) from each of the temporary files; comparing the records frm all the files against each other and selecting teh lowest and writing it. Reading the next record from that file (or cache); repeat until the end of all the files.

    There are no "thousands of sequential records". There may be occasional runs of sequential records from a given file, but each of those records needs to be compared against all of the next records from each of the other files before it is know whether it is sequentially the next record to be written or not. And sometimes, there will be runs where no two sequential records in the output file, comes from the same input file.

    It is exactly this entirely unpredictable, entirely random access pattern that so badly affects caching heuristics, combined with the nearly three times as many IOPs that makes sort&merge at least an order of magnitude slower than hash&memory, all else being equal.

    Please note. That isn't based on theoretical guesswork, but actual tests. And not just once. And not just recently. But many times going back many years.

    Almost exactly 30 years ago, we sped up the merging if 5 sets of 6 million records using twin PDP-11/60 and (for the time, huge) "ganged" RK07 28MB disk packs, by 11 times using exactly this technique. Because the data was bigger than the combined disks we could have on line, by merging the key-fields and record offsets in memory, by loading each of the input disk packs once in turn, then the output pack once, rather than having to constantly swap input packs with temporary packs with the input packs again, then the output pack; as with the previous sort all 5 and then merge process.

    And I went through the same comparative process using real files, disks and data just a couple of months ago in response to a post right here in this site. And a couple of years before than on my previous machine.

    And it is neither theory nor speculation when I tell you that the differential has never been below an order of magnitude, and because I/O speeds have stagnated, whilst cpu speed have risen exponentially (until recently), and RAM sizes have and continue to grow exponentially, the ratio of performance gain of the hash&memory over sort&merge is increasing.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Apparently we have different assumptions about the number of temporary files needed, the size of those files, the number of records which can be read from a file in one block.
      And then both sorted files have to be read, merged and written to the final merged file.
      The data has to be read twice and written twice. The merging into to final form happens during 2nd write. Now of course if this data set is so humongous that multiple merge passes are required (too many files to do it at one time), then that's going to take more RW cycles! Depends upon how humongous the data set is. My assumption is that all temp files can be merged in one operation. X records from each file are queued in memory - just select lowest record of the top of all of them. repeat until one queue runs dry. recharge with another 100MB of data or however much fits in each queue, etc... The key here is that issuing one read for a bunch of records is way faster than many trips through read. Of course the problem here is that last record may not be complete and the code has to deal with that. Read cache is most likely counter productive and what I'd want is the memory mapped I/O straight from the disk.

      If size of sort block is 600MB, size of data is 3GB. A simple approach results in 5 temp files (600 MB each). That's small enough number that say merge queues are 100MB, all 5 files open at once...requires 30 block reads (100MB) operations. That's what I meant by sequential blocks - the files to merged would be read not record by record but by a series of records, what I called a "block". Number of records in block is unknown depends upon data. Reading the next say 100MB of data from a file is more efficient than seeking around trying to pick off 1KB pieces at random places.

      Of course all of these parameters matter. If thing are huge and we get into say 50 temporary files, things are going to slow way down because more merge passes are required. Neither of us know what any of these numbers really are in the OP's application.

      And all of this has to do with how smart or not smart the system sort is on the OP's system. I wouldn't recommend that the OP recode it. For all I know the OP has a .csv file and just needs to type in the proper one liner to sort his data.

      I think we've got the major issues out in the discussion.

        The data has to be read twice and written twice. The merging into to final form happens during 2nd write

        I'm sorry, but yet again, no.

        The only reason to sort, is if the dataset is too large to fit in memory. Otherwise there is absolutely no good reason to use an O(N logN) sort algorithm when an O(N) hash algorithm does the job.

        Only once the dataset grown so large that it is impossible to hold all the records in memory at one time, does the sort&merge algorithm have any merit whatsoever.

        And if you cannot hold the whole dataset in memory then you cannot sort it in memory is a single operation. So, you use a disk sort that reads a subset of the data, and writes the sorted subset to a temporary file. Then you read another subset into memory and sort it and write the results to a temporary file. And so on until you have S sorted subsets in temporary files. Now you need to merge those subsets together.

        Read and sort, and write to temp; read from temp to merge and write to sorted. And that produced one sorted file. 2N reads; 2N writes; one sorted file.

        Now you need to repeat that for the seconds file. 4N reads; 4N writes; two sorted files.

        Now you need to read both sorted files, merge them and write the final merged output.

        That's 8N reads; and 8N writes; one resultant merged file.

        Total 16N IOPs. Compared to 6N IOPs for the hash&memory algorithm.

        And if, after 3 attempts of my trying to explain this to you, you still cannot see it, please keep it to yourself, because you are simply wrong. Sorry.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.