in reply to Re: [OT] A measure of 'sortedness'?
in thread [OT] A measure of 'sortedness'?

Q: How to sort lots of data. A: Using mergesort.

Ooh! Why didn't I think of that. Erm . Let me re-phrase: What makes you think I haven't already considered that?

That is the traditional way isn't it. (Surprised its taken you 3 days to look that up.) So what's wrong with the traditional way?

To find out, generate yourself a 100GB file of random 64-bit ints (about 15 minutes work with a one-liner) and then use your local system sort utility to sort it. Then come back to me when its finished, and we'll discuss it further.

See you in 3 or 4 days!


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re^3: [OT] A measure of 'sortedness'?
by Anonymous Monk on Mar 20, 2015 at 18:14 UTC

    If you sort the chunks while generating, then apply the n-way merge as described—that full procedure results in a 100GB write, plus another 100GB of reads and 100 GB of writes. In total, 300GB of streaming (external memory access).

    You are searching for an algorithm that does better, or claim to have found one?

      If you sort the chunks while generating,

      That presupposes that I'm generating the files.

      You are searching for an algorithm that does better

      I'm writing a utility to sort (much) larger-than-memory files of fixed-length records.

      I already have a working version -- actually several, each an improvement on the previous -- but they are pretty slow. Infinitely faster than my system sort utility, but still significant room for improvement.

      or claim to have found one?

      It is not hard to beat your local system sort utility for this purpose.

      To understand why, you really need to do as I suggested, generate a large, fixed-length record binary file, and try them for yourself.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked