Can you improve upon my algorithm.

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

...and can you assess its big-O efficiency?

The algorithm merges, in-place, $G sorted sub-partitions of an array into a fully sorted array.

Graphically this shows the algorithm merging 3, 4-element sub-partitions (warning: best viewed with a wide screen):

C:\test>inplaceMerge -N=12 -G=3
Starting point:
 pA=0                p[i]=4 
[231, 584, 675, 787] [165, 177, 714, 838] [451, 456, 714, 887]     tem
+p = d[pA]:231, d[pA] = d[p[i]]:165, d[pB-1] = d[pB]:177, d[--pB] = te
+mp:231
      pA=1           p[i]=4
[165, 584, 675, 787] [177, 231, 714, 838] [451, 456, 714, 887]     tem
+p = d[pA]:584, d[pA] = d[p[i]]:177, d[pB-1] = d[pB]:231, d[--pB] = te
+mp:584
           pA=2      p[i]=4
[165, 177, 675, 787] [231, 584, 714, 838] [451, 456, 714, 887]     tem
+p = d[pA]:675, d[pA] = d[p[i]]:231, d[pB-1] = d[pB]:584, d[--pB] = te
+mp:675
                pA=3 p[i]=4
[165, 177, 231, 787] [584, 675, 714, 838] [451, 456, 714, 887]     tem
+p = d[pA]:787, d[pA] = d[p[i]]:584, d[pB-1] = d[pB]:675,  d[pB-1] = d
+[pB]:714, d[--pB] = temp:787

 pA=0                                     p[i]=8
[165, 177, 231, 584] [675, 714, 787, 838] [451, 456, 714, 887] 
      pA=1                                p[i]=8
[165, 177, 231, 584] [675, 714, 787, 838] [451, 456, 714, 887]
           pA=2                           p[i]=8
[165, 177, 231, 584] [675, 714, 787, 838] [451, 456, 714, 887]
                pA=3                      p[i]=8
                
[165, 177, 231, 584] [675, 714, 787, 838] [451, 456, 714, 887]     tem
+p = d[pA]:548, d[pA] = d[p[i]]:451, d[pB-1] = d[pB]:456, d[--pB] = te
+mp:458
                      pA=4                p[i]=8
[165, 177, 231, 451] [675, 714, 787, 838] [456, 584, 714, 887]     tem
+p = d[pA]:675, d[pA] = d[p[i]]:456, d[pB-1] = d[pB]:584, d[--pB] = te
+mp:675
                           pA=5           p[i]=8
[165, 177, 231, 451] [456, 714, 787, 838] [584, 675, 714, 887]     tem
+p = d[pA]:714, d[pA] = d[p[i]]:584, d[pB-1] = d[pB]:675, d[--pB] = te
+mp:714
                                pA=6      p[i]=8
[165, 177, 231, 451] [456, 584, 787, 838] [675, 714, 714, 887]     tem
+p = d[pA]:787, d[pA] = d[p[i]]:675, d[pB-1] = d[pB]:714, d[pB-1] = d[
+pB]:714, d[--pB] = temp:787
                                     pA=7 p[i]=8
[165, 177, 231, 451] [456, 584, 675, 838] [714, 714, 787, 887]     tem
+p = d[pA]:838, d[pA] = d[p[i]]:714, d[pB-1] = d[pB]:714, d[pB-1] = d[
+pB]:787, d[--pB] = temp:838
[download]

There is a brief description of the algorithm in words, in the comments below:

#! perl -slw
use strict;
use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000;

our $N //= 400;
our $G //= 4; my $GSize = int( $N / $G );
our $S //= 1; srand( $S );

my @p = map $_ * $GSize, 0 .. $G-1; #pp \@p;
my @d = map int( rand 1000 ), 1 .. $N;

sub show{ print pp [ @d[ $_ .. $_ + $GSize-1 ] ] for @p; print''; }

splice @d, $_, $GSize, sort{ $a <=> $b } @d[ $_ .. $_ + $GSize-1 ] for
+ @p; show;

## The algorithm starts here; 
## with @d containing $G sorted subpartitions.
## and  @p the start indexes of those sub-partitions
## and  $GSize containing the length of the sub-partitions.
## Note:If the command line args partition the array into $G+1 partiti
+ons; 
## eg. -N=100 -G=7 given partitions [0..13][14..27][28..41][42..55][56
+..69][70..83][84..98][99]
## the last 'extra' partition is ignored. (For now.)

## merge each partition after the first
for my $i ( 1 .. $#p ) {

    ## against all the following partitions starting with the first
    for( my $pA = 0; $pA < $p[ $i ]; ++$pA ) {

        ## $pA indexes to the next element to be merged
        ## if it is less the top element of the partition being merged
        if( $d[ $pA ] > $d[ $p[ $i ] ] ) {
            my $temp = $d[ $pA ];           ## Keep a copy of it
            $d[ $pA ] = $d[ $p[ $i ] ];     ## swap the lower element 
+into place
            my $pB;                         ## external declaration al
+lows $pB to remember where it finished when the next loop terminates

            ## Insertion sort the element being swapped out ($temp) in
+to the right place in the partition being merged with
            for( $pB = $p[ $i ]+1; $pB < ( $p[ $i ]+$GSize ) && $d[ $p
+B ] < $temp; ++$pB ) { $d[ $pB-1 ] = $d[ $pB ]; };

            ## Put the swapped out element into the gap we opened up
            $d[ --$pB ] = $temp;
        }
    #    show; <STDIN>;
    }
    #show;
}
show;
[download]

Notes:

This is destined to be implemented in C, on larger than memory datasets -- hence sorted sub-partitions, so time comparisons of this algorithm against just sorting the whole array aren't applicable.
For similar reasons, suggestions like using splice instead of the for loop insertion sort also aren't applicable.

Searches for "in-place merges of sorted sub-partitions" turn up lots of people asking, but few good answers and lots of "just sort them together" -- completely impractical for my larger-than memory datasets.

Ive tried to relate what I'm doing to the formal literature (eg. http://comjnl.oxfordjournals.org/content/38/8/681.full.pdf, which seems to be very pertinent), but despite the time I've spent reading this sort of paper, I still find passages like:

Block Exchanges
A block exchange of two consecutive blocks U and V of sizes l₁ and l₂, respectively, that occupy segment L[c..c + l₁ + l₂ - 1 ] result in the movement of block U which occupies segment L[c..c + l₁ - 1 ] to L[c + l₂..c + l₁ + l₂ - 1 ], and in the movement of V initially in segment L[c + l₁..c + l₁ + l₂ - 1 ] to segment L[c..c + l₂ - 1 ]

as a description of: [...uuuuVVVVVVVVV...] becomes [...VVVVVVVVVuuuu...], entirely opaque for the first half dozen readings. Even the included figure doesn't help much. (Why oh why are they written this way?) Even the choice of U & V seems almost deliberate to make things hard to read.

What am I looking/hoping for?

Improvements to this algorithm (as opposed to the Perl implementation). Feels like there ought to be a way to avoid starting pA at the beginning for each extra partition.

Better algorithms. I've looked, but maybe someone knows of one.

Better approaches to the overall problem.

Any assessment of the complexity. I've tried 3 times and gotton 3 different answers.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Comment on Can you improve upon my algorithm. Select or Download Code

Replies are listed 'Best First'.
Re: Can you improve upon my algorithm. by bitingduck (Deacon) on Mar 12, 2015 at 05:29 UTC
Looking through your code and the paper you linked, it looks like what you're doing is more or less what they describe in the introduction-- element by element merge that will require O(N) moves and O(m log (n/m)) comparisons for two lists, with appropriate multipliers as in my other post for performance on multiple lists. The next fancier way to do it (section 2.2) is that rather than swapping one element at a time, keep going up the subset until you get to an element that you can't swap in (because it's bigger than the next number in the sequence you're going into) and then do a block move of everything up to that point, swapping into the buffer as needed. This idea of moving a group at a time is where the faster algorithms come in. Section 2.3 is the same thing, but blockwise-- break each list into blocks (say of sqrt(Gsize)) and compare the starts and ends of those, and move whole blocks when possible, partial blocks when necessary, and storing in the buffer. The buffer is apparently where stability gets lost, if its ordering gets munged too badly. But this block swapping (is this whole block able to fit before that one?) is the basis for the faster algorithm. The stuff between 2.3 until 4.0 looks like mostly history and distraction from the algorithm in 4.0-4.3, where block exchange looks like my triple reverse from last week:). In the worst case (sublists like `[1,4,7,...][2,5,8,...][3,6,9,...]`) it should end up performing like the one element at a time algorithm, but in cases where there are runs that fit inside each other, you'll approach the limiting performance. Edit about an hour later...: Here's what I hope is an almost self explanatory example using only Block_Exchange (if I did it right). You start with the two leftmost groups, figure out where you need to put the left, right, and center of a block_exchange, then you do it, adjust your three pointers, then repeat. When you get to the next presorted group you start over again. For small samples it doesn't look like much, but if you have big sparse lists it's probably somewhat better than bubbling individual elements into place ]= rightmost point of a sorted subgroup [= leftmost point of a sorted subgroup \| indicate the ends of the block_exchange points ^ indicates the center about which the exchange is made [ 1 7 15 32] [3 17 54 63][27 35 59 60][11 23 24 29] [ 1 \|7 15 32]^[3 17\| 54 63][27 35 59 60][11 23 24 29] [ 1 \|7 15 32]^[3 \|17 54 63][27 35 59 60][11 23 24 29] [ 1 3 ][7 15\| 32]^[17 \|54 63][27 35 59 60][11 23 24 29] [ 1 3 7 15 17\| 32 54 63]^[27\| 35 59 60][11 23 24 29] [ 1 3 7 15 17 27 32 \|54 63]^[35\| 59 60][11 23 24 29] [ 1 3 7 15 17 27 32 35 54 \|63]^[59 60]\|[11 23 24 29] [ 1 3 7 \|15 17 27 32 35 54 59 60 63]^[11\| 23 24 29] [ 1 3 7 11 15 17 \|27 32 35 54 59 60 63]^[23 24\| 29] [ 1 3 7 11 15 17 23 24 27\| 32 35 54 59 60 63]^[29]\| [ 1 3 7 11 15 17 23 24 27 29 32 35 54 59 60 63] [download]	[reply] [d/l] [select]
Re^2: Can you improve upon my algorithm. by BrowserUk (Patriarch) on Mar 12, 2015 at 14:48 UTC
It'll take me a while to wrap my head around this, and read through the (much clearer) paper you linked in your other reply, but I can't resist commenting on: where block exchange looks like my triple reverse from last week:) I laughed myself to sleep after reading: "There is a more elegant algorithm which is considered to be a part of folklore. blah blah blah. The proof of correctness is left to the reader." No wonder, given their documentation methods. My proof: `abc12345 54321cba 12345cba 12345abc` [download] Sor''d! :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l]
Re: Can you improve upon my algorithm. by bitingduck (Deacon) on Mar 12, 2015 at 04:16 UTC
I'm a little too brain dead at the moment to follow the details of your algorithm, but for a crude upper limit of the complexity you can take it as a Merge Sort where you're starting part of the way through. You've already assembled from pairs back up to G lists of size GSize. So if you were doing a full up Merge Sort on N elements you'd be O(NlogN), but you've already done a bunch of it, so you're reducing it by G(Gsize log Gsize), giving you something like O((NlogN)-G(Gsize log Gsize)) as an upper limit.(Ignoring for the moment the in-place issue) These guys (Huang and Langston, who are ref 17 in your more dense paper) are a little bit easier to follow and claim that their in-place algorithm is O(N). They say less than 2N, so it's probably linear, or only slightly worse, and your ref gives you something like the sum (where I'm abbreviating Gsize as s for readability): s log s + (s log (2s /s + 1)) + (s log (3s /s + 1))... up to the integer in the log getting to N/G, which looks like it reduces to: =O(s log s + (s log (3)) + (s log (4))...(s log (N/G))) where the sum is assuming you merge one pair of your subsets together, then merge in the next one (so you've got length s and 2s, then length s and 3s, up until you've done them all. =O(s (Sum (log 1..N/G))) =O(s log((N/G)!)) which doesn't look so bad-- it's better than NlogN and if you substitute back in s=N for the length of a single G=1, you end up back with NlogN, so it might be about right. (I think... I probably did something wrong along the way and will want to strike this out later...but after this bit of exercise I might be able to wade through your algorithm, too...) Edit: Huang and Langston isn't quite in place-- it uses a constant extra buffer, as does your Symvonis reference. another edit to wrap the whole expression after the O in display equation 1 in parentheses.	[reply]
Re: Can you improve upon my algorithm. by atcroft (Abbot) on Mar 12, 2015 at 04:43 UTC
As to the complexity, (to me) it looks very much like O(n^2) in time. Although each sub-partition is sorted, the worst case would seem to be the case of sets where the largest values appear in the initial set, the second largest values in the second set, and so forth, until the smallest values appear in the last set. (Is it a radix or bucket sort?) Hope that helps.	[reply]
Re^2: Can you improve upon my algorithm. by BrowserUk (Patriarch) on Mar 12, 2015 at 14:08 UTC
As to the complexity, (to me) it looks very much like O(n^2) in time. I agree. the Perl version certainly looks to be tending that way with respect to the time for a number of records with the number of buffers fixed. But the number of buffers also has a non-linear effect that is harder to quantify. These are timings taking using the Perl prototype: `Array N u m b e r o f b u f f e r s. Size 2 4 8 16 + 10000 9.710932970 14.441396952 16.980458975 17.980459929 20000 38.186435938 54.618164063 65.470575094 68.913941145 40000 148.056374073 224.696855068 244.073187113 260.217727184 80000 310.495392084 882.082360983 969.694694996 1098.488962889` [download] But Perl's internal overheads tend to cloud the issue. These are a similar set of timings, (though with 1000 times as many records), from a straight Perl to C conversion: `Array N u m b e r o f b u f f e r s. Size 2 4 8 1 +6 10000000 0.248736811 0.497881829 0.799180768 1.33560723 +0 20000000 0.285630879 1.024963784 1.879727758 3.58831668 +6 40000000 0.402336371 1.211978628 3.943654854 7.76231127 +6 80000000 0.966428312 2.093228984 10.674847289 15.20866750 +8 160000000 3.682530644 7.499381800 16.773807549 34.45496560 +7` [download] That's a lot less pessimistic of the algorithm. Based on the timings alone, it looks to be linear(ish) in the number of buffers and (um) log.quadratic in the number of records? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^3: Can you improve upon my algorithm. by bitingduck (Deacon) on Mar 12, 2015 at 15:01 UTC
Based on the timings alone, it looks to be linear(ish) in the number of buffers and (um) log.quadratic in the number of records? That sounds consistent with what the wikipedia article says about in place merge sorting: O(Nlog^2(N)), but with a "reference needed" note, so it might be empirical or folklore.	[reply]
Re: Can you improve upon my algorithm. by salva (Canon) on Mar 12, 2015 at 10:11 UTC
Your requirements are quite unusual. Usually, you want and algorithm to work in-place when it holds the full dataset in memory. More rarely, when it holds the dataset in disk. But both? Can't you at least use some extra RAM as a buffer?	[reply]
Re^2: Can you improve upon my algorithm. by BrowserUk (Patriarch) on Mar 12, 2015 at 13:09 UTC
Can't you at least use some extra RAM as a buffer? I frequently need to sort huge (upto and sometimes beyond 100GB) files of fixed length records. External sorting programs -- gnu sort or MS sort -- are very bad at this; it takes days. So finally I got around to writing my own. Part of what makes general purpose algorithms bad at this, is the time they spend writing to and re-reading from spill files. Another part is that they need to deal with variable length records. My approach is to sort the files in-place on disk by using memory mapped IO. Map a (available physical memory-sized) chunk, and sort it, map a chunk and sort it. Then merge the sorted chunks. By making the chunks 1/2 the size of available physical memory I can ensure I can have full access to two complete chunks at any time. One approach to merging the chunks is to sort them together in pairs in a kind of bi-directional bubble sort. Note: each of the letters here represents a (say) 2GB chunk of sorted memory. By sorting them pair-wise and moving left-to-right, then right-to-left, (one less each time) eventually everything ends up in the right place. This works; it's not very efficient.: <Reveal this spoiler or all in this thread> So now I'm looking at insertion sorting the pairs together. Now to answer your question Yes. But how much is required to significantly improve things? That is: I typically have 5 to 6 GB of my 8GB physical memory available when I'm not running something important. So, I could split that into (say) 2 x 2GB chunks and have 1 or 2GB as scratch space. But is 1 or 2GB enough to make a significant difference when merging 2 x 2GB? As an additional complication to assessment, that scratch space would be outside of the memory mapped file, so like spill-files, whatever is copied into it will eventually have to be copied back. Less costly than a spill file but still a cost to be considered. In addition, whatever scratch space I use comes out of the buffer space, thus requiring more buffers; thus more (smaller) partial sorts and more buffers to merge. It's a difficult assessment. In the standard merge phase of an external sort, you have an unlimited buffer -- the output file -- as the repository of the merged records. I originally thought that as the merge phase picks the next record from the sorted spill-files one at a time, I'd be able to put them back into the space vacated by the records extracted from the buffers for merge cross-comparison. It doesn't work out like that. You have to ripple the replaced record down into its proper place in order to maintain the sub-sorted order required for the merge to work. I looked at trying to do that on a n-at-a-time basis, but it gets very messy with mapping and remapping sub-chunk sized bits (pages) of the N buffers. So then I came up with this pair-wise, two at a time process which makes dealing with the mapping problem much simpler. And it works. In C, qsorting a 100 million 8-byte records takes ~40 seconds. qsorting that as 2x50 million record buffers takes ~37 seconds, and insertion sorting them together takes ~3 seconds: `C:\test\C>inPlaceMerge 100000000 2 qsort took 39.071390618 secs for 100000000 element array qsort took 37.654166926 secs for 2 partitions of 100000000 element arr +ay isort took 2.540992389 secs for 2 partitions of 100000000 element arra +y` [download] It looks less rosy with more partitions: `C:\test\C>inPlaceMerge 100000000 4 qsort took 39.024688029 secs for 100000000 element array qsort took 36.237764463 secs for 4 partitions of 100000000 element arr +ay isort took 6.176087620 secs for 4 partitions of 100000000 element arra +y C:\test\C>inPlaceMerge 100000000 8 qsort took 38.918586479 secs for 100000000 element array qsort took 34.792559013 secs for 8 partitions of 100000000 element arr +ay isort took 12.882464701 secs for 8 partitions of 100000000 element arr +ay` [download] But nothing show stopping. The time cost of the insertion sort for multiple buffers is not linear. It (roughly) doubles with each extra buffer. This is because the distance you have to ripple out of place records gets further each time. I think there is scope for improvement there. Eg. I'm currently merging the records 'forwards' -- 1&2, 1&3, 1&4, 1&5... -- does it make any difference if I reverse that and so 4&5, 3&4, 2&3, 1&2? Obviously when merging 1 & 2, there will still be records from 1 that need to move all the way to buffer 5. But shouldn't I be able to skip ahead by comparing the record against the top record in each of the now order buffers, and insert it there? Trouble is, I'd still need to ripple all those displaced above it, back through all the other buffers. Maybe that's where the scratch space comes in. As I ripple records of the top of a buffer (to accommodate a new insertion), rather than rippling the rest of the buffers back to that insertion records source, I spill it into a (say) page-sized 'defer' scratch buffer. And I only ripple it back once that buffer is full. With luck, as the contents come of the top of an already sorted buffer, the whole buffer can be insert into the bottom of the preceding buffer as a block. But does that help? You still need to move the preceding buffer(s) 'up' by a buffer-size to accommodate it; and then their spills into the preceding buffer ... At that point, I can't keep the costs/benefits analysis clear in my head... With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^3: Can you improve upon my algorithm. by salva (Canon) on Mar 12, 2015 at 22:04 UTC
The thing is that once you have the chunks sorted, what the theory says is that in practice, you can not do much better than a N-way merge. A N-way merge uses almost no memory, just enough to avoid the trashing caused by reading from N files in parallel (say 10MB per-way, for a 100GB file and 50 chunks, that means 10_000 seeks, approximately 200s in a modern disk ignoring fragmentation and FS overhead). The computational complexity of the merge is O(MlogN) where M is the dataset size (100GB). Actually if you count the number of comparisons that have to be performed it is c = Mlog₂N. In your case, as N = 50, c = 8M = 800GB. Current RAM bandwidth is around 10GB, but that is for peaks, lets assume that in practice the memory bandwidth is 1GB/s. That means that those comparisons (the bottleneck there is the RAM) would take 800GB / 1GB/s = 800s. 15 minutes! So, where does the time goes? in reading the data from disk into memory and writing it back. With a good HDD (100MB/s) reading+writing 100GB are 2 * 100GB / 100MB/s = 2000s. In summary, merging the 50 blocks, should take around 3000s. 1 hour! In theory!!!	[reply]
Re^4: Can you improve upon my algorithm. by BrowserUk (Patriarch) on Mar 13, 2015 at 04:40 UTC
Re^5: Can you improve upon my algorithm. by salva (Canon) on Mar 13, 2015 at 10:06 UTC
Some notes below your chosen depth have not been shown here
Re^2: Can you improve upon my algorithm. by bitingduck (Deacon) on Mar 12, 2015 at 15:22 UTC
Your requirements are quite unusual. In place handling of huge datasets have been a theme of BrowserUk's questions for a while now. They're interesting because they're things you generally wouldn't think twice about for small data sets, but in a resource constrained environment (time, memory) how you decide to do them can have a big effect on performance, or even solvability. It's kind of like going back to 1980 with a 1 Mhz machine with 16 kB of RAM and trying to do anything at all. Back in those days I once made a list of all the 4 letter combinations and stored them to disk, then printed them and discovered my printer was as fast as my floppy drive.	[reply]

Block Exchanges

What am I looking/hoping for?

Now to answer your question