Re^5: [OT] Swapping buffers in place.

Replies are listed 'Best First'.
Re^6: [OT] Swapping buffers in place. (Final summation.) by BrowserUk (Patriarch) on Mar 02, 2015 at 14:15 UTC
Yeah, I don't expect it would be a cache-friendly algorithm. I didn't expect you were swapping buffers that large, or I'd've probably gone with something that would do blocks at a time, rather than entries. The pathological behaviour only exhibits with buffer/offset pairings specifically chosen to cause it. The paramaters in the following run, 536,870,912/268,435,456 represent an exactly 2GB buffer with the partition half way along: C:\test\C>bufswap 536870912 268435456 2 ### 2^29 2^28 size:536870912 offset;268435456 [ 0 1 ... 268435454 268435455 ^ 268435456 268 +435457 ... 536870910 536870911 ] [ 268435456 268435457 ... 536870910 536870911 ^ 0 + 1 ... 268435454 268435455 ] iterative: swaps:536870912 took 4.364767872 secs. [ 0 1 ... 268435454 268435455 ^ 268435456 268 +435457 ... 536870910 536870911 ] [ 268435456 268435457 ... 536870910 536870911 ^ 0 + 1 ... 268435454 268435455 ] recursive: swaps:268435456 took 2.476348362 secs. [download] Note that although the iterative algorithm does double the swaps, (equal to the total number of U64 elements in the buffer ) of the recursive version, the time taken is substantially less than double (175% .v. 200%) than the time taken by the recursive version. So, there is nothing wrong with the efficiency of the implementation of the algorithm. And in many (maybe most) buffer/offset ratios, they both do the same number of swaps: `C:\test\C>bufswap 30000000 20000000 2 size:30000000 offset;20000000 [ 0 1 ... 19999998 19999999 ^ 20000000 20 +000001 ... 29999998 29999999 ] [ 20000000 20000001 ... 29999998 29999999 ^ 0 + 1 ... 19999998 19999999 ] iterative: swaps:30000000 took 0.235516729 secs. [ 0 1 ... 19999998 19999999 ^ 20000000 20 +000001 ... 29999998 29999999 ] [ 20000000 20000001 ... 29999998 29999999 ^ 0 + 1 ... 19999998 19999999 ] recursive: swaps:20000000 took 0.163142653 secs.` [download] By anyone's standards, performing 1/4 billion swaps of two, 64-bit values, shifting 4GB of data around in the process, all in 4.36 seconds, ain't at all tardy! Oh to be able to get the data off my discs fast enough to make use of that ~1GB/s throughput. The best I get out of my HD is about 200MB/s. My new SSD gets close, but only if the cache is warm. Also note that the total memory above is only 57 x U64s (456 bytes) different to the pathological case, where the buffer/offset pairing is 536,870,855/268,435,399 (2GB+largest prime smaller/largest prime smaller): C:\test\C>bufswap 536870855 268435399 2 ### 2^28+prime prime size:536870855 offset;268435399 [ 0 1 ... 268435397 268435398 ^ 268435399 268 +435400 ... 536870853 536870854 ] [ 268435399 268435400 ... 536870853 536870854 ^ 0 + 1 ... 268435397 268435398 ] iterative: swaps:536870855 took 23.912365889 secs. [ 0 1 ... 268435397 268435398 ^ 268435399 268 +435400 ... 536870853 536870854 ] [ 268435399 268435400 ... 536870853 536870854 ^ 0 + 1 ... 268435397 268435398 ] recursive: swaps:536870854 took 3.633964430 secs. [download] Note that the number of swaps are only 1 different, but the time is 658% longer. That threw me through a lot of hoops! Even though I chose the numbers, (power of 2/largestest prime smaller; based on previous experience of a modulo arithmetic driven process), to exacerbate the scattering effect of the modulo arithmetic -- maximising the number of times the pointer has to wrap around -- it took me completely by surprise at the size of the difference it made. I went over and over and over the code looking for some cock-up before the reality of cache misses hit me. But I also 'lucked out'. I've since tried various other power-of-two/largest prime smaller combinations, and none of them come close to triggering the same kind of differences: C:\test\C>bufswap 2097147 1048571 2 ### 2^20+prime prime size:2097147 offset;1048571 [ 0 1 ... 1048569 1048570 ^ 1048571 1 +048572 ... 2097145 2097146 ] [ 1048571 1048572 ... 2097145 2097146 ^ 0 + 1 ... 1048569 1048570 ] iterative: swaps:2097147 took 0.050917112 secs. [ 0 1 ... 1048569 1048570 ^ 1048571 1 +048572 ... 2097145 2097146 ] [ 1048571 1048572 ... 2097145 2097146 ^ 0 + 1 ... 1048569 1048570 ] recursive: swaps:2097146 took 0.022471381 secs. [ 0 1 ... 1048569 1048570 ^ 1048571 1 +048572 ... 2097145 2097146 ] [ 1048571 1048572 ... 2097145 2097146 ^ 0 + 1 ... 1048569 1048570 ] reversive: swaps:2097146 took 0.030067538 secs. C:\test\C>bufswap 33554429 16777213 2 ### 2^24+prime prime size:33554429 offset;16777213 [ 0 1 ... 16777211 16777212 ^ 16777213 16 +777214 ... 33554427 33554428 ] [ 16777213 16777214 ... 33554427 33554428 ^ 0 + 1 ... 16777211 16777212 ] iterative: swaps:33554429 took 0.519861854 secs. [ 0 1 ... 16777211 16777212 ^ 16777213 16 +777214 ... 33554427 33554428 ] [ 16777213 16777214 ... 33554427 33554428 ^ 0 + 1 ... 16777211 16777212 ] recursive: swaps:33554428 took 0.212679449 secs. [ 0 1 ... 16777211 16777212 ^ 16777213 16 +777214 ... 33554427 33554428 ] [ 16777213 16777214 ... 33554427 33554428 ^ 0 + 1 ... 16777211 16777212 ] reversive: swaps:33554428 took 0.275975182 secs. C:\test\C>bufswap 134217723 67108859 2 ### 2^26+prime prime size:134217723 offset;67108859 [ 0 1 ... 67108857 67108858 ^ 67108859 67 +108860 ... 134217721 134217722 ] [ 67108859 67108860 ... 134217721 134217722 ^ 0 + 1 ... 67108857 67108858 ] iterative: swaps:134217723 took 2.986443308 secs. [ 0 1 ... 67108857 67108858 ^ 67108859 67 +108860 ... 134217721 134217722 ] [ 67108859 67108860 ... 134217721 134217722 ^ 0 + 1 ... 67108857 67108858 ] recursive: swaps:134217722 took 0.957572077 secs. [ 0 1 ... 67108857 67108858 ^ 67108859 67 +108860 ... 134217721 134217722 ] [ 67108859 67108860 ... 134217721 134217722 ^ 0 + 1 ... 67108857 67108858 ] reversive: swaps:134217722 took 1.185514126 secs. C:\test\C>bufswap 268435417 134217689 2 ### 2^27+prime prime size:268435417 offset;134217689 [ 0 1 ... 134217687 134217688 ^ 134217689 134 +217690 ... 268435415 268435416 ] [ 134217689 134217690 ... 268435415 268435416 ^ 0 + 1 ... 134217687 134217688 ] iterative: swaps:268435417 took 11.702352647 secs. [ 0 1 ... 134217687 134217688 ^ 134217689 134 +217690 ... 268435415 268435416 ] [ 134217689 134217690 ... 268435415 268435416 ^ 0 + 1 ... 134217687 134217688 ] recursive: swaps:268435416 took 1.888340531 secs. [ 0 1 ... 134217687 134217688 ^ 134217689 134 +217690 ... 268435415 268435416 ] [ 134217689 134217690 ... 268435415 268435416 ^ 0 + 1 ... 134217687 134217688 ] reversive: swaps:268435416 took 2.337291168 secs. [download] And it is only when the buffer size gets up into the GB range that the shear space which the modulo arithmetic has to distributes the wraps in, ultimately defeats the combined L1/L2/L3 caches and their LRU algorithms and starts to slow things down. I set out to find an algorithm. I lucked out that I had some of the best minds -- which must include this particular anonymonk -- apparently looking for diversion on a Lazy Sunday, and got three excellent ones to choose from. And, being me, once they were all coded & running, the only way I was going to choose was with a benchmark :) BTW: the third algorithm, labeled "reversive" in the spoiler above, is the first one suggested, by bitingduck, that would obviously work. It does 3 successive reverses of the buffer (thus two full passes): `abcdef123 321fedcba reverse the entire buffer. 123fedcba reverse the smaller partition. 123abcdef reverse the larger partition.` [download] I originally dismissed this/saved it as a last resort, because (I thought) that it would do far more swaps than was necessary. In reality, it does exactly the same as the recursive solution (1 less) in every situation I've tested. And it is very clean to program: `U32 reverseBuffer( register U64 p, const U32 start, const U32 size ) +{ register U32 i, swaps = 0; const U32 half = size >> 1; for( i = start; i < half; ++i ) { swapElems( p, i, size -1 -i ); ++swaps; } return swaps; } U32 xchgBufferV( U64 p, U32 size, U32 offset ) { register U32 swaps = 0; swaps += reverseBuffer( p, 0, size ); swaps += reverseBuffer( p, 0, size - offset ); swaps += reverseBuffer( p + (size - offset), 0, offset ); return swaps; }` [download] The only saving graces from that summary dismissal is that a) bitingduck was himself somewhat dismissive of it; b) it led to a highly entertaining thread; far more analysis than I would ever have done otherwise; and the brilliant outcome of an algorithm that seems to perform near optimally under all the circumstances I've thrown at it. I hope other people were as entertained by it as I was. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^7: [OT] Swapping buffers in place. (Final summation.) by bitingduck (Deacon) on Mar 02, 2015 at 15:33 UTC
I've been quite entertained by the whole thread all weekend- it's been my diversion from wrestling with getting a new little computer to talk to various data acq devices using little C snippets to do the direct talking and wrapping Perl around it to make it faster to code up and modify. Fortunately I have most of it wired up and set for remote access, so I could work from the comfort of my couch much of the time. I'm remembering how much C I've forgotten... The algorithm that surprises me the most is the recursive, in that you didn't end up with a lot of stack overhead slowing it down too much or hanging it up-- even just pushing return addresses on the stack and no data it's going to get large for your datasets. Once you showed the manual swaps, I was sure iterative was going to be the answer. Most of what I thought the reversing algorithm had going for it was a) simple to code, b) sure to work without debugging, and c) you can probably fit it in about 20 bytes of code if you get sent back to 1983.	[reply]
Re^8: [OT] Swapping buffers in place. (Final summation.) by BrowserUk (Patriarch) on Mar 02, 2015 at 19:38 UTC
The algorithm that surprises me the most is the recursive, in that you didn't end up with a lot of stack overhead slowing it down too much or hanging it up-- even just pushing return addresses on the stack and no data it's going to get large for your datasets. I did try to find a pathological case for the recursive version. Using a simple sub it is easy to see the steps it goes through for a particular set of parameters: `[0] Perl> sub steps { my($i,$m,$o) = (0,@_); print( "$i: $m $o" ), ++$ +i,( $m >= $o ? ($m -= $o) : ($o -= $m )) while $m and $o; };; [0] Perl> steps( 9, 6 );; 0: 9 6 1: 3 6 2: 3 3 [0] Perl> steps( 17, 9 );; 0: 17 9 1: 8 9 2: 8 1 3: 7 1 4: 6 1 5: 5 1 6: 4 1 7: 3 1 8: 2 1 9: 1 1` [download] It pathological case is when there is a difference of just 1 element between the two buffers. The first step moves the smaller buffer into its final position in the one go; but then the odd byte has to be 'rippled' through the rest of the larger buffer to get it (and the rest of the larger buffer) into their final positions. So then I tried running it with 2^29 2^28-1 (but turn of the reporting and output just the final number of steps: `[0] Perl> sub steps { my($i,$m,$o) = (0,@_); ++$i,( $m >= $o ? ($m -= +$o) : ($o -= $m )) while $m and $o; print $i; };; Subroutine steps redefined at (eval 17) line 1, <STDIN> line 9. [0] Perl> steps( 229, 228-1 );; 134217731` [download] 134 million steps, with all but one moving 1 byte one place at a time. The prospects didn't look good. As you say, that'd involve 134 million 8-byte return addresses on the stack. Except it didn't. I saw no memory growth at all. Which could only mean that the compiler had tail-call optimised the recursion way. And sure enough, looking at the asm it has. It also eliminated the duplicated `y == size` comparison: p$ = 80 size$ = 88 x$ = 96 y$ = 104 swaps$ = 112 _xchgBuffersR PROC ; 53 : U32 _xchgBuffersR( register U64 p, const U32 size, register +U32 x, register U32 y, U32 register swaps ) { $LN19: mov QWORD PTR [rsp+16], rbx mov QWORD PTR [rsp+24], rbp mov QWORD PTR [rsp+32], rsi push rdi push r12 push r13 push r14 push r15 sub rsp, 32 ; 00000020H mov rax, rcx movsxd rbx, r9d movsxd r15, r8d and eax, 7 mov esi, edx mov r13, rcx mov r14, r15 mov rdi, rbx mov QWORD PTR tv246[rsp], rax npad 8 $LL12@xchgBuffer@2: ############## H +ERE* (see below) ; 54 : U32 register i; ; 55 : U32 const savedY = y; mov ebp, ebx mov r12, rdi ; 56 : ; 57 : assert( y <= size ); cmp ebx, esi jbe SHORT $LN8@xchgBuffer@2 lea rdx, OFFSET FLAT:$SG3784 lea rcx, OFFSET FLAT:$SG3785 mov r8d, 57 ; 00000039H call _wassert mov rax, QWORD PTR tv246[rsp] $LN8@xchgBuffer@2: ; 58 : assert( ( (U64)p & 7ull ) == 0 ); test rax, rax je SHORT $LN9@xchgBuffer@2 lea rdx, OFFSET FLAT:$SG3788 lea rcx, OFFSET FLAT:$SG3789 mov r8d, 58 ; 0000003aH call _wassert mov rax, QWORD PTR tv246[rsp] $LN9@xchgBuffer@2: ; 59 : ; 60 : if( y == size ) return swaps; cmp ebx, esi je SHORT $LN18@xchgBuffer@2 ############# recursio +n end condition jumps to THERE ; 61 : ; 62 : for( i = x; i < savedY; ++i ) { cmp r15d, ebx jae SHORT $LL12@xchgBuffer@2 mov eax, ebx sub eax, r15d mov edx, eax add DWORD PTR swaps$[rsp], edx add r15d, eax $LL4@xchgBuffer@2: ; 63 : if( y == size ) y = savedY; cmp ebx, esi jne SHORT $LN1@xchgBuffer@2 mov ebx, ebp mov rdi, r12 $LN1@xchgBuffer@2: ; 64 : swapElems( p, x++, y++ ); mov rax, QWORD PTR [r13+rdi8] mov rcx, QWORD PTR [r13+r148] inc rdi mov QWORD PTR [r13+r148], rax mov QWORD PTR [r13+rdi8-8], rcx inc r14 inc ebx sub rdx, 1 jne SHORT $LL4@xchgBuffer@2 ; 65 : ++swaps; ; 66 : } ; 67 : return _xchgBuffersR( p, size, x, y, swaps ); mov rax, QWORD PTR tv246[rsp] jmp $LL12@xchgBuffer@2 ############### and j +ust before it, it unconditionally jumps to HERE $LN18@xchgBuffer@2: ############### +THERE ; 59 : ; 60 : if( y == size ) return swaps; mov eax, DWORD PTR swaps$[rsp] ; 68 : } mov rbx, QWORD PTR [rsp+88] mov rbp, QWORD PTR [rsp+96] mov rsi, QWORD PTR [rsp+104] add rsp, 32 ; 00000020H pop r15 pop r14 pop r13 pop r12 pop rdi ret 0 _xchgBuffersR ENDP [download] So, when I fed those parameters into the real code: C:\test\C>bufswap 536870912 268435455 2 ### 2^29 2^28-1 size:536870912 offset;268435455 [ 0 1 ... 268435453 268435454 ^ 268435455 268 +435456 ... 536870910 536870911 ] [ 268435455 268435456 ... 536870910 536870911 ^ 0 + 1 ... 268435453 268435454 ] iterative: swaps:536870912 took 7.359985176 secs. [ 0 1 ... 268435453 268435454 ^ 268435455 268 +435456 ... 536870910 536870911 ] [ 268435455 268435456 ... 536870910 536870911 ^ 0 + 1 ... 268435453 268435454 ] recursive: swaps:536870911 took 3.762964774 secs. [ 0 1 ... 268435453 268435454 ^ 268435455 268 +435456 ... 536870910 536870911 ] [ 268435455 268435456 ... 536870910 536870911 ^ 0 + 1 ... 268435453 268435454 ] reversive: swaps:536870911 took 4.901475821 secs. [download] Nada! No pathological behaviour. That one-at-a-time ripple may look/sound laborious, but its basically a single run through memory, like copying a string, that the hardware and caches are designed to optimise for. Hence why I never bothered to test the iterative version of the algorithm that anonymonk posted above. The compiler made a better job of the conversion. (Besides, then I wouldn't have been able to call it the recursive algorithm; and I so like my 'iterative'/'recursive'/'reversive' labels :) Most of what I thought the reversing algorithm had going for it was a) simple to code, b) sure to work without debugging, and c) you can probably fit it in about 20 bytes of code if you get sent back to 1983. :) There is definitely something to be said for simple! With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]

Yeah, I don't expect it would be a cache-friendly algorithm. I didn't expect you were swapping buffers that large, or I'd've probably gone with something that would do blocks at a time, rather than entries.

The pathological behaviour only exhibits with buffer/offset pairings specifically chosen to cause it.

The paramaters in the following run, 536,870,912/268,435,456 represent an exactly 2GB buffer with the partition half way along:

C:\test\C>bufswap 536870912 268435456 2   ### 2^29 2^28
size:536870912 offset;268435456
[         0          1 ...  268435454    268435455 ^  268435456    268
+435457   ...  536870910  536870911 ]
[ 268435456  268435457 ...  536870910    536870911 ^          0       
+     1   ...  268435454  268435455 ]
iterative: swaps:536870912 took 4.364767872 secs.

[         0          1 ...  268435454    268435455 ^  268435456    268
+435457   ...  536870910  536870911 ]
[ 268435456  268435457 ...  536870910    536870911 ^          0       
+     1   ...  268435454  268435455 ]
recursive: swaps:268435456 took 2.476348362 secs.
[download]

Note that although the iterative algorithm does double the swaps, (equal to the total number of U64 elements in the buffer ) of the recursive version, the time taken is substantially less than double (175% .v. 200%) than the time taken by the recursive version. So, there is nothing wrong with the efficiency of the implementation of the algorithm.

And in many (maybe most) buffer/offset ratios, they both do the same number of swaps:

C:\test\C>bufswap 30000000 20000000 2
size:30000000 offset;20000000
[         0          1 ...   19999998     19999999 ^   20000000     20
+000001   ...   29999998   29999999 ]
[  20000000   20000001 ...   29999998     29999999 ^          0       
+     1   ...   19999998   19999999 ]
iterative: swaps:30000000 took 0.235516729 secs.

[         0          1 ...   19999998     19999999 ^   20000000     20
+000001   ...   29999998   29999999 ]
[  20000000   20000001 ...   29999998     29999999 ^          0       
+     1   ...   19999998   19999999 ]
recursive: swaps:20000000 took 0.163142653 secs.
[download]

By anyone's standards, performing 1/4 billion swaps of two, 64-bit values, shifting 4GB of data around in the process, all in 4.36 seconds, ain't at all tardy!

Oh to be able to get the data off my discs fast enough to make use of that ~1GB/s throughput. The best I get out of my HD is about 200MB/s. My new SSD gets close, but only if the cache is warm.

Also note that the total memory above is only 57 x U64s (456 bytes) different to the pathological case, where the buffer/offset pairing is 536,870,855/268,435,399 (2GB+largest prime smaller/largest prime smaller):

C:\test\C>bufswap 536870855 268435399 2      ### 2^28+prime prime
size:536870855 offset;268435399
[         0          1 ...  268435397    268435398 ^  268435399    268
+435400   ...  536870853  536870854 ]
[ 268435399  268435400 ...  536870853    536870854 ^          0       
+     1   ...  268435397  268435398 ]
iterative: swaps:536870855 took 23.912365889 secs.

[         0          1 ...  268435397    268435398 ^  268435399    268
+435400   ...  536870853  536870854 ]
[ 268435399  268435400 ...  536870853    536870854 ^          0       
+     1   ...  268435397  268435398 ]
recursive: swaps:536870854 took 3.633964430 secs.
[download]

Note that the number of swaps are only 1 different, but the time is 658% longer. That threw me through a lot of hoops!

Even though I chose the numbers, (power of 2/largestest prime smaller; based on previous experience of a modulo arithmetic driven process), to exacerbate the scattering effect of the modulo arithmetic -- maximising the number of times the pointer has to wrap around -- it took me completely by surprise at the size of the difference it made. I went over and over and over the code looking for some cock-up before the reality of cache misses hit me.

But I also 'lucked out'. I've since tried various other power-of-two/largest prime smaller combinations, and none of them come close to triggering the same kind of differences:

C:\test\C>bufswap 2097147 1048571 2          ### 2^20+prime prime
size:2097147 offset;1048571
[         0          1 ...    1048569      1048570 ^    1048571      1
+048572   ...    2097145    2097146 ]
[   1048571    1048572 ...    2097145      2097146 ^          0       
+     1   ...    1048569    1048570 ]
iterative: swaps:2097147 took 0.050917112 secs.

[         0          1 ...    1048569      1048570 ^    1048571      1
+048572   ...    2097145    2097146 ]
[   1048571    1048572 ...    2097145      2097146 ^          0       
+     1   ...    1048569    1048570 ]
recursive: swaps:2097146 took 0.022471381 secs.

[         0          1 ...    1048569      1048570 ^    1048571      1
+048572   ...    2097145    2097146 ]
[   1048571    1048572 ...    2097145      2097146 ^          0       
+     1   ...    1048569    1048570 ]
reversive: swaps:2097146 took 0.030067538 secs.


C:\test\C>bufswap 33554429 16777213 2       ### 2^24+prime prime
size:33554429 offset;16777213
[         0          1 ...   16777211     16777212 ^   16777213     16
+777214   ...   33554427   33554428 ]
[  16777213   16777214 ...   33554427     33554428 ^          0       
+     1   ...   16777211   16777212 ]
iterative: swaps:33554429 took 0.519861854 secs.

[         0          1 ...   16777211     16777212 ^   16777213     16
+777214   ...   33554427   33554428 ]
[  16777213   16777214 ...   33554427     33554428 ^          0       
+     1   ...   16777211   16777212 ]
recursive: swaps:33554428 took 0.212679449 secs.

[         0          1 ...   16777211     16777212 ^   16777213     16
+777214   ...   33554427   33554428 ]
[  16777213   16777214 ...   33554427     33554428 ^          0       
+     1   ...   16777211   16777212 ]
reversive: swaps:33554428 took 0.275975182 secs.

C:\test\C>bufswap 134217723 67108859 2    ### 2^26+prime prime
size:134217723 offset;67108859
[         0          1 ...   67108857     67108858 ^   67108859     67
+108860   ...  134217721  134217722 ]
[  67108859   67108860 ...  134217721    134217722 ^          0       
+     1   ...   67108857   67108858 ]
iterative: swaps:134217723 took 2.986443308 secs.

[         0          1 ...   67108857     67108858 ^   67108859     67
+108860   ...  134217721  134217722 ]
[  67108859   67108860 ...  134217721    134217722 ^          0       
+     1   ...   67108857   67108858 ]
recursive: swaps:134217722 took 0.957572077 secs.

[         0          1 ...   67108857     67108858 ^   67108859     67
+108860   ...  134217721  134217722 ]
[  67108859   67108860 ...  134217721    134217722 ^          0       
+     1   ...   67108857   67108858 ]
reversive: swaps:134217722 took 1.185514126 secs.

C:\test\C>bufswap 268435417 134217689 2  ### 2^27+prime prime
size:268435417 offset;134217689
[         0          1 ...  134217687    134217688 ^  134217689    134
+217690   ...  268435415  268435416 ]
[ 134217689  134217690 ...  268435415    268435416 ^          0       
+     1   ...  134217687  134217688 ]
iterative: swaps:268435417 took 11.702352647 secs.

[         0          1 ...  134217687    134217688 ^  134217689    134
+217690   ...  268435415  268435416 ]
[ 134217689  134217690 ...  268435415    268435416 ^          0       
+     1   ...  134217687  134217688 ]
recursive: swaps:268435416 took 1.888340531 secs.

[         0          1 ...  134217687    134217688 ^  134217689    134
+217690   ...  268435415  268435416 ]
[ 134217689  134217690 ...  268435415    268435416 ^          0       
+     1   ...  134217687  134217688 ]
reversive: swaps:268435416 took 2.337291168 secs.
[download]

And it is only when the buffer size gets up into the GB range that the shear space which the modulo arithmetic has to distributes the wraps in, ultimately defeats the combined L1/L2/L3 caches and their LRU algorithms and starts to slow things down.

I set out to find an algorithm. I lucked out that I had some of the best minds -- which must include this particular anonymonk -- apparently looking for diversion on a Lazy Sunday, and got three excellent ones to choose from. And, being me, once they were all coded & running, the only way I was going to choose was with a benchmark :)

BTW: the third algorithm, labeled "reversive" in the spoiler above, is the first one suggested, by bitingduck, that would obviously work. It does 3 successive reverses of the buffer (thus two full passes):

abcdef123
321fedcba reverse the entire buffer.
123fedcba reverse the smaller partition.
123abcdef reverse the larger partition.
[download]

I originally dismissed this/saved it as a last resort, because (I thought) that it would do far more swaps than was necessary. In reality, it does exactly the same as the recursive solution (1 less) in every situation I've tested. And it is very clean to program:

U32 reverseBuffer( register U64 *p, const U32 start, const U32 size ) 
+{
    register U32 i, swaps = 0;
    const U32 half = size >> 1;

    for( i = start; i < half; ++i ) {
        swapElems( p, i, size -1 -i );
        ++swaps;
    }
    return swaps;
}

U32 xchgBufferV( U64 *p, U32 size, U32 offset ) {
    register U32 swaps = 0;
    swaps += reverseBuffer( p, 0, size );
    swaps += reverseBuffer( p, 0, size - offset );
    swaps += reverseBuffer( p + (size - offset), 0, offset );
    return swaps;
}
[download]

The only saving graces from that summary dismissal is that a) bitingduck was himself somewhat dismissive of it; b) it led to a highly entertaining thread; far more analysis than I would ever have done otherwise; and the brilliant outcome of an algorithm that seems to perform near optimally under all the circumstances I've thrown at it.

I hope other people were as entertained by it as I was.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

[reply]
[d/l]
[select]

I've been quite entertained by the whole thread all weekend- it's been my diversion from wrestling with getting a new little computer to talk to various data acq devices using little C snippets to do the direct talking and wrapping Perl around it to make it faster to code up and modify. Fortunately I have most of it wired up and set for remote access, so I could work from the comfort of my couch much of the time. I'm remembering how much C I've forgotten...

The algorithm that surprises me the most is the recursive, in that you didn't end up with a lot of stack overhead slowing it down too much or hanging it up-- even just pushing return addresses on the stack and no data it's going to get large for your datasets. Once you showed the manual swaps, I was sure iterative was going to be the answer.

Most of what I thought the reversing algorithm had going for it was a) simple to code, b) sure to work without debugging, and c) you can probably fit it in about 20 bytes of code if you get sent back to 1983.

[reply]

The algorithm that surprises me the most is the recursive, in that you didn't end up with a lot of stack overhead slowing it down too much or hanging it up-- even just pushing return addresses on the stack and no data it's going to get large for your datasets.

I did try to find a pathological case for the recursive version. Using a simple sub it is easy to see the steps it goes through for a particular set of parameters:

[0] Perl> sub steps { my($i,$m,$o) = (0,@_); print( "$i: $m $o" ), ++$
+i,( $m >= $o ? ($m -= $o) : ($o -= $m )) while $m and $o; };;

[0] Perl> steps( 9, 6 );;
0: 9 6
1: 3 6
2: 3 3

[0] Perl> steps( 17, 9 );;
0: 17 9
1: 8 9
2: 8 1
3: 7 1
4: 6 1
5: 5 1
6: 4 1
7: 3 1
8: 2 1
9: 1 1
[download]

It pathological case is when there is a difference of just 1 element between the two buffers. The first step moves the smaller buffer into its final position in the one go; but then the odd byte has to be 'rippled' through the rest of the larger buffer to get it (and the rest of the larger buffer) into their final positions.

So then I tried running it with 2^29 2^28-1 (but turn of the reporting and output just the final number of steps:

[0] Perl> sub steps { my($i,$m,$o) = (0,@_); ++$i,( $m >= $o ? ($m -= 
+$o) : ($o -= $m )) while $m and $o; print $i; };;
Subroutine steps redefined at (eval 17) line 1, <STDIN> line 9.

[0] Perl> steps( 2**29, 2**28-1 );;
134217731
[download]

134 million steps, with all but one moving 1 byte one place at a time. The prospects didn't look good. As you say, that'd involve 134 million 8-byte return addresses on the stack. Except it didn't. I saw no memory growth at all. Which could only mean that the compiler had tail-call optimised the recursion way. And sure enough, looking at the asm it has. It also eliminated the duplicated y == size comparison:

p$ = 80
size$ = 88
x$ = 96
y$ = 104
swaps$ = 112
_xchgBuffersR PROC

; 53   : U32 _xchgBuffersR( register U64 *p, const U32 size, register 
+U32 x, register U32 y, U32 register swaps ) {

$LN19:
    mov    QWORD PTR [rsp+16], rbx
    mov    QWORD PTR [rsp+24], rbp
    mov    QWORD PTR [rsp+32], rsi
    push    rdi
    push    r12
    push    r13
    push    r14
    push    r15
    sub    rsp, 32                    ; 00000020H
    mov    rax, rcx
    movsxd    rbx, r9d
    movsxd    r15, r8d
    and    eax, 7
    mov    esi, edx
    mov    r13, rcx
    mov    r14, r15
    mov    rdi, rbx
    mov    QWORD PTR tv246[rsp], rax
    npad    8
$LL12@xchgBuffer@2:                                 ############## **H
+ERE** (see below)

; 54   :     U32 register i;
; 55   :     U32 const savedY = y;

    mov    ebp, ebx
    mov    r12, rdi

; 56   : 
; 57   :     assert( y <= size );

    cmp    ebx, esi
    jbe    SHORT $LN8@xchgBuffer@2
    lea    rdx, OFFSET FLAT:$SG3784
    lea    rcx, OFFSET FLAT:$SG3785
    mov    r8d, 57                    ; 00000039H
    call    _wassert
    mov    rax, QWORD PTR tv246[rsp]
$LN8@xchgBuffer@2:

; 58   :     assert( ( (U64)p & 7ull ) == 0 );

    test    rax, rax
    je    SHORT $LN9@xchgBuffer@2
    lea    rdx, OFFSET FLAT:$SG3788
    lea    rcx, OFFSET FLAT:$SG3789
    mov    r8d, 58                    ; 0000003aH
    call    _wassert
    mov    rax, QWORD PTR tv246[rsp]
$LN9@xchgBuffer@2:

; 59   : 
; 60   :     if( y == size ) return swaps;

    cmp    ebx, esi
    je    SHORT $LN18@xchgBuffer@2              ############# recursio
+n end condition jumps to **THERE**

; 61   : 
; 62   :     for( i = x; i < savedY; ++i ) {

    cmp    r15d, ebx
    jae    SHORT $LL12@xchgBuffer@2
    mov    eax, ebx
    sub    eax, r15d
    mov    edx, eax
    add    DWORD PTR swaps$[rsp], edx
    add    r15d, eax
$LL4@xchgBuffer@2:

; 63   :         if( y == size ) y = savedY;

    cmp    ebx, esi
    jne    SHORT $LN1@xchgBuffer@2
    mov    ebx, ebp
    mov    rdi, r12
$LN1@xchgBuffer@2:

; 64   :         swapElems( p, x++, y++ );

    mov    rax, QWORD PTR [r13+rdi*8]
    mov    rcx, QWORD PTR [r13+r14*8]
    inc    rdi
    mov    QWORD PTR [r13+r14*8], rax
    mov    QWORD PTR [r13+rdi*8-8], rcx
    inc    r14
    inc    ebx
    sub    rdx, 1
    jne    SHORT $LL4@xchgBuffer@2

; 65   :         ++swaps;
; 66   :     }
; 67   :     return _xchgBuffersR( p, size, x, y, swaps );

    mov    rax, QWORD PTR tv246[rsp]
    jmp    $LL12@xchgBuffer@2                    ############### and j
+ust before it, it unconditionally jumps to **HERE**
$LN18@xchgBuffer@2:                                   ############### 
+**THERE**

; 59   : 
; 60   :     if( y == size ) return swaps;   

    mov    eax, DWORD PTR swaps$[rsp]

; 68   : }

    mov    rbx, QWORD PTR [rsp+88]
    mov    rbp, QWORD PTR [rsp+96]
    mov    rsi, QWORD PTR [rsp+104]
    add    rsp, 32                    ; 00000020H
    pop    r15
    pop    r14
    pop    r13
    pop    r12
    pop    rdi
    ret    0
_xchgBuffersR ENDP
[download]

So, when I fed those parameters into the real code:

C:\test\C>bufswap 536870912 268435455 2  ### 2^29 2^28-1
size:536870912 offset;268435455
[         0          1 ...  268435453    268435454 ^  268435455    268
+435456   ...  536870910  536870911 ]
[ 268435455  268435456 ...  536870910    536870911 ^          0       
+     1   ...  268435453  268435454 ]
iterative: swaps:536870912 took 7.359985176 secs.

[         0          1 ...  268435453    268435454 ^  268435455    268
+435456   ...  536870910  536870911 ]
[ 268435455  268435456 ...  536870910    536870911 ^          0       
+     1   ...  268435453  268435454 ]
recursive: swaps:536870911 took 3.762964774 secs.

[         0          1 ...  268435453    268435454 ^  268435455    268
+435456   ...  536870910  536870911 ]
[ 268435455  268435456 ...  536870910    536870911 ^          0       
+     1   ...  268435453  268435454 ]
reversive: swaps:536870911 took 4.901475821 secs.
[download]

Nada! No pathological behaviour. That one-at-a-time ripple may look/sound laborious, but its basically a single run through memory, like copying a string, that the hardware and caches are designed to optimise for. Hence why I never bothered to test the iterative version of the algorithm that anonymonk posted above. The compiler made a better job of the conversion. (Besides, then I wouldn't have been able to call it the recursive algorithm; and I so like my 'iterative'/'recursive'/'reversive' labels :)