in reply to Re: Can you improve upon my algorithm.
in thread Can you improve upon my algorithm.

Can't you at least use some extra RAM as a buffer?

I frequently need to sort huge (upto and sometimes beyond 100GB) files of fixed length records. External sorting programs -- gnu sort or MS sort -- are very bad at this; it takes days. So finally I got around to writing my own.

Part of what makes general purpose algorithms bad at this, is the time they spend writing to and re-reading from spill files. Another part is that they need to deal with variable length records.

My approach is to sort the files in-place on disk by using memory mapped IO. Map a (available physical memory-sized) chunk, and sort it, map a chunk and sort it. Then merge the sorted chunks.

By making the chunks 1/2 the size of available physical memory I can ensure I can have full access to two complete chunks at any time.

One approach to merging the chunks is to sort them together in pairs in a kind of bi-directional bubble sort.

Note: each of the letters here represents a (say) 2GB chunk of sorted memory. By sorting them pair-wise and moving left-to-right, then right-to-left, (one less each time) eventually everything ends up in the right place. This works; it's not very efficient.:

C:\test>BiDiBubbleSort -N=8 I>H G F E D C B A : comp&swap( 0, 1 ) I < H H I>G F E D C B A : comp&swap( 1, 2 ) I < G H G I>F E D C B A : comp&swap( 2, 3 ) I < F H G F I>E D C B A : comp&swap( 3, 4 ) I < E H G F E I>D C B A : comp&swap( 4, 5 ) I < D H G F E D I>C B A : comp&swap( 5, 6 ) I < C H G F E D C I>B A : comp&swap( 6, 7 ) I < B H G F E D C B I>A : comp&swap( 7, 8 ) I < A H G F E D C B<A I : comp&swap( 6, 7 ) B < A H G F E D C<A B I : comp&swap( 5, 6 ) C < A H G F E D<A C B I : comp&swap( 4, 5 ) D < A H G F E<A D C B I : comp&swap( 3, 4 ) E < A H G F<A E D C B I : comp&swap( 2, 3 ) F < A H G<A F E D C B I : comp&swap( 1, 2 ) G < A H<A G F E D C B I : comp&swap( 0, 1 ) H < A A H>G F E D C B I : comp&swap( 1, 2 ) H < G A G H>F E D C B I : comp&swap( 2, 3 ) H < F A G F H>E D C B I : comp&swap( 3, 4 ) H < E A G F E H>D C B I : comp&swap( 4, 5 ) H < D A G F E D H>C B I : comp&swap( 5, 6 ) H < C A G F E D C H>B I : comp&swap( 6, 7 ) H < B A G F E D C<B H I : comp&swap( 5, 6 ) C < B A G F E D<B C H I : comp&swap( 4, 5 ) D < B A G F E<B D C H I : comp&swap( 3, 4 ) E < B A G F<B E D C H I : comp&swap( 2, 3 ) F < B A G<B F E D C H I : comp&swap( 1, 2 ) G < B A B G>F E D C H I : comp&swap( 2, 3 ) G < F A B F G>E D C H I : comp&swap( 3, 4 ) G < E A B F E G>D C H I : comp&swap( 4, 5 ) G < D A B F E D G>C H I : comp&swap( 5, 6 ) G < C A B F E D<C G H I : comp&swap( 4, 5 ) D < C A B F E<C D G H I : comp&swap( 3, 4 ) E < C A B F<C E D G H I : comp&swap( 2, 3 ) F < C A B C F>E D G H I : comp&swap( 3, 4 ) F < E A B C E F>D G H I : comp&swap( 4, 5 ) F < D A B C E<D F G H I : comp&swap( 3, 4 ) E < D A B C D E F G H I

So now I'm looking at insertion sorting the pairs together.

Now to answer your question

Yes. But how much is required to significantly improve things?

That is: I typically have 5 to 6 GB of my 8GB physical memory available when I'm not running something important. So, I could split that into (say) 2 x 2GB chunks and have 1 or 2GB as scratch space. But is 1 or 2GB enough to make a significant difference when merging 2 x 2GB?

As an additional complication to assessment, that scratch space would be outside of the memory mapped file, so like spill-files, whatever is copied into it will eventually have to be copied back. Less costly than a spill file but still a cost to be considered. In addition, whatever scratch space I use comes out of the buffer space, thus requiring more buffers; thus more (smaller) partial sorts and more buffers to merge. It's a difficult assessment.

In the standard merge phase of an external sort, you have an unlimited buffer -- the output file -- as the repository of the merged records.

I originally thought that as the merge phase picks the next record from the sorted spill-files one at a time, I'd be able to put them back into the space vacated by the records extracted from the buffers for merge cross-comparison. It doesn't work out like that. You have to ripple the replaced record down into its proper place in order to maintain the sub-sorted order required for the merge to work.

I looked at trying to do that on a n-at-a-time basis, but it gets very messy with mapping and remapping sub-chunk sized bits (pages) of the N buffers. So then I came up with this pair-wise, two at a time process which makes dealing with the mapping problem much simpler. And it works.

In C, qsorting a 100 million 8-byte records takes ~40 seconds. qsorting that as 2x50 million record buffers takes ~37 seconds, and insertion sorting them together takes ~3 seconds:

C:\test\C>inPlaceMerge 100000000 2 qsort took 39.071390618 secs for 100000000 element array qsort took 37.654166926 secs for 2 partitions of 100000000 element arr +ay isort took 2.540992389 secs for 2 partitions of 100000000 element arra +y

It looks less rosy with more partitions:

C:\test\C>inPlaceMerge 100000000 4 qsort took 39.024688029 secs for 100000000 element array qsort took 36.237764463 secs for 4 partitions of 100000000 element arr +ay isort took 6.176087620 secs for 4 partitions of 100000000 element arra +y C:\test\C>inPlaceMerge 100000000 8 qsort took 38.918586479 secs for 100000000 element array qsort took 34.792559013 secs for 8 partitions of 100000000 element arr +ay isort took 12.882464701 secs for 8 partitions of 100000000 element arr +ay

But nothing show stopping.

The time cost of the insertion sort for multiple buffers is not linear. It (roughly) doubles with each extra buffer. This is because the distance you have to ripple out of place records gets further each time.

I think there is scope for improvement there. Eg. I'm currently merging the records 'forwards' -- 1&2, 1&3, 1&4, 1&5... -- does it make any difference if I reverse that and so 4&5, 3&4, 2&3, 1&2?

Obviously when merging 1 & 2, there will still be records from 1 that need to move all the way to buffer 5. But shouldn't I be able to skip ahead by comparing the record against the top record in each of the now order buffers, and insert it there? Trouble is, I'd still need to ripple all those displaced above it, back through all the other buffers.

Maybe that's where the scratch space comes in. As I ripple records of the top of a buffer (to accommodate a new insertion), rather than rippling the rest of the buffers back to that insertion records source, I spill it into a (say) page-sized 'defer' scratch buffer. And I only ripple it back once that buffer is full. With luck, as the contents come of the top of an already sorted buffer, the whole buffer can be insert into the bottom of the preceding buffer as a block.

But does that help? You still need to move the preceding buffer(s) 'up' by a buffer-size to accommodate it; and then their spills into the preceding buffer ...

At that point, I can't keep the costs/benefits analysis clear in my head...


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re^3: Can you improve upon my algorithm.
by salva (Canon) on Mar 12, 2015 at 22:04 UTC
    The thing is that once you have the chunks sorted, what the theory says is that in practice, you can not do much better than a N-way merge.

    A N-way merge uses almost no memory, just enough to avoid the trashing caused by reading from N files in parallel (say 10MB per-way, for a 100GB file and 50 chunks, that means 10_000 seeks, approximately 200s in a modern disk ignoring fragmentation and FS overhead).

    The computational complexity of the merge is O(MlogN) where M is the dataset size (100GB). Actually if you count the number of comparisons that have to be performed it is c = M*log2N. In your case, as N = 50, c = 8*M = 800GB. Current RAM bandwidth is around 10GB, but that is for peaks, lets assume that in practice the memory bandwidth is 1GB/s. That means that those comparisons (the bottleneck there is the RAM) would take 800GB / 1GB/s = 800s. 15 minutes!

    So, where does the time goes? in reading the data from disk into memory and writing it back. With a good HDD (100MB/s) reading+writing 100GB are 2 * 100GB / 100MB/s = 2000s.

    In summary, merging the 50 blocks, should take around 3000s. 1 hour! In theory!!!

      In theory!!!!

      In practice!

      My E: drive is brand new, barely used and connected to a SATA-3 i/f.

      wc -c does almost nothing but increment a variable, and it reads a 34GB file in 192 secs, at a rate of 171MB/s:

      E:\>dir 30GB.dat 28/02/2015 08:21 34,560,000,000 30GB.dat [ 0:08:11.82] E:\>wc -c 30GB.dat 34560000000 30GB.dat [ 0:11:23.17] E:\>perl -E"say 34560000000 / 192 / 1024**2;" 171.661376953125

      I created 50 1GB files:

      I then ran the following script which:

      1. reads 10MB chunks into separate buffers, from each of the 50 files;
      2. selects one of the buffers and trims a 16-byte record from it;
      3. loop back to 2; until one of the buffers is empty;
      4. repopulate that buffer from disk;
      5. loop back to 2; until a file is empty;
      6. close the file and remove the buffer from the array of buffers;
      7. loop back to 2; until all the buffers are empty;

      Total runtime (projection): 85 hours; average read rate: 10MB/minute or 171kb/second.

      The script:

      Note: there is no heap or merge or insertion sort of the records being 'merged', indeed no comparisons whatsoever; and no writing to disk.

      Just the generation of a random number, the copying of 16 bytes, and the adjustment of a pointer (CUR) for each record.

      But it happens 3.3 billion times.

      With the result that there is almost no IO activity for 98% of the time, and then occasional bursts as 10MB buffers are repopulated.

      As the nature of random is to distribute evenly, the 50 x 10MB reads tend to come grouped pretty close together; roughly every 50 minutes.

      With 102 buffers per 1GB file, thats 102 * 50 = 5100 minutes or 85 hours or 3 1/2 days; which mirrors the timings I've experienced using external sort programs.

      I'm not going to let it run to completion. I hope you'll understand why.

      Update:At the point of posting this, the process had just been running for exactly 2 hours and has read 890.4MB.

      That's 7.42MB/minute which projects to 4.8 days to complete the 50GB.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        Oh, well, I meant if you do it in C, Perl has too much overhead for that.

        For instance, removing everything not in the hot path for your script:

        while( @fhs ) { printf "\r%d\t", scalar( @fhs ); my $r = int rand @fhs; 57 unless( length( $bufs[ $r ] ) ); my $nextRec = substr( $bufs[ $r ], -16, 16, '' ); }

        Perl compiles it into 40 OPs, 35 inside the loop:

        $ perl -MO=Concise,-exec buk.pl 1 <0> enter 2 <;> nextstate(main 5 buk.pl:1) v:{ 3 <{> enterloop(next->z last->13 redo->4) v 10 <#> gv[*fhs] s 11 <1> rv2av[t2] sK/1 12 <|> and(other->4) vK/1 4 <;> nextstate(main 1 buk.pl:2) v 5 <0> pushmark sM 6 <$> const[PV "\r%d\t"] sM 7 <#> gv[*fhs] s 8 <1> rv2av[t5] sK/1 9 <@> prtf vK a <;> nextstate(main 1 buk.pl:3) v b <#> gv[*fhs] s c <1> rv2av[t8] sK/1 d <1> rand[t9] sK/1 e <1> int[t10] sK/1 f <0> padsv[$r:1,3] sRM*/LVINTRO g <2> sassign vKS/2 h <;> nextstate(main 2 buk.pl:4) v:{ i <#> gv[*bufs] s j <1> rv2av sKR/1 k <0> padsv[$r:1,3] s l <2> aelem sK/2 m <1> length[t12] sKP/1 n <|> or(other->o) vK/1 o <;> nextstate(main 2 buk.pl:5) v:{ p <#> gv[*bufs] s q <1> rv2av sKR/1 r <0> padsv[$r:1,3] s s <2> aelem sKM/2 t <$> const[IV -16] s/FOLD u <$> const[IV 16] s v <$> const[PV ""] s w <@> substr[t16] sK/4 x <0> padsv[$nextRec:2,3] sRM*/LVINTRO y <2> sassign vKS/2 z <0> unstack v goto 10 13 <2> leaveloop vKP/2 14 <@> leave[1 ref] vKP/REFC
        My estimation is that every OP performs between 50 and 100 machine instructions with a good proportion of performance-unfriendly conditional branches. So, roughly, every run of the loop above would require at least 2000 instructions and may be reaching the capacity of the L1 code cache.

        On the other hand, an N-way merge in C, on the hot path of the inner loop, requires just to perform a couple of operations plus the key comparisons log2N times. That may sum up to 50 instructions for every loop cycle (=sorted record)!

        update: Oops, I forgot to say, in theory!

        1 hour to sort 100GB, actually, seems too fast to me. But I think the analysis is sound, and unless I am overlooking some hidden bottleneck, the order or magnitude should be right!