in reply to sorting type question- space problems

The best approach is to read records from the file until you hit some size benchmark (whatever fits in memory, including sorting requirements), sort the records, then print them out to a sequentially-numbered file. Once the entire 20 GB are written out this way, probably to like 50-100 files, you then go through the files in pairs and merge using a line-by-line merge routine. Then go through those files in pairs, and so on until you only have one file. If this is something that has to run automatically, just have a cleanup routine run some reasonable amount of time later and wipe the files if there are more than one.
  • Comment on Re: sorting type question- space problems

Replies are listed 'Best First'.
Re^2: sorting type question- space problems
by Laurent_R (Canon) on Sep 15, 2013 at 08:44 UTC

    Your approach is great if a full sort was required. But what I was trying to say in my earlier post above is that the OP does not actually require a full sort, because there are about 900 sort keys for 5 million records. So that simply splitting the data into 900 buckets (files) and then reread the files in the right order is all what is needed. In short, each piece of data is read and written twice and compared only once. This is basically an O(n) algorithm and will be much faster than any general purpose sorting algorithm.

      True, if there are only 900 keys.