in reply to Re^4: [OT] Swapping buffers in place.
in thread [OT] Swapping buffers in place.
Yeah, I don't expect it would be a cache-friendly algorithm. I didn't expect you were swapping buffers that large, or I'd've probably gone with something that would do blocks at a time, rather than entries. I didn't check the assembler, either, but I'd the extra conditionals mine does would also jam things up.
Anonymonk's version looks pretty nifty, so I'll have to go over it to be sure I know what it's doing. ;^)
...roboticus
When your only tool is a hammer, all problems look like your thumb.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^6: [OT] Swapping buffers in place. (Final summation.)
by BrowserUk (Patriarch) on Mar 02, 2015 at 14:15 UTC | |
Yeah, I don't expect it would be a cache-friendly algorithm. I didn't expect you were swapping buffers that large, or I'd've probably gone with something that would do blocks at a time, rather than entries. The pathological behaviour only exhibits with buffer/offset pairings specifically chosen to cause it. The paramaters in the following run, 536,870,912/268,435,456 represent an exactly 2GB buffer with the partition half way along:
Note that although the iterative algorithm does double the swaps, (equal to the total number of U64 elements in the buffer ) of the recursive version, the time taken is substantially less than double (175% .v. 200%) than the time taken by the recursive version. So, there is nothing wrong with the efficiency of the implementation of the algorithm. And in many (maybe most) buffer/offset ratios, they both do the same number of swaps:
By anyone's standards, performing 1/4 billion swaps of two, 64-bit values, shifting 4GB of data around in the process, all in 4.36 seconds, ain't at all tardy!Oh to be able to get the data off my discs fast enough to make use of that ~1GB/s throughput. The best I get out of my HD is about 200MB/s. My new SSD gets close, but only if the cache is warm. Also note that the total memory above is only 57 x U64s (456 bytes) different to the pathological case, where the buffer/offset pairing is 536,870,855/268,435,399 (2GB+largest prime smaller/largest prime smaller):
Note that the number of swaps are only 1 different, but the time is 658% longer. That threw me through a lot of hoops! Even though I chose the numbers, (power of 2/largestest prime smaller; based on previous experience of a modulo arithmetic driven process), to exacerbate the scattering effect of the modulo arithmetic -- maximising the number of times the pointer has to wrap around -- it took me completely by surprise at the size of the difference it made. I went over and over and over the code looking for some cock-up before the reality of cache misses hit me. But I also 'lucked out'. I've since tried various other power-of-two/largest prime smaller combinations, and none of them come close to triggering the same kind of differences:
And it is only when the buffer size gets up into the GB range that the shear space which the modulo arithmetic has to distributes the wraps in, ultimately defeats the combined L1/L2/L3 caches and their LRU algorithms and starts to slow things down. I set out to find an algorithm. I lucked out that I had some of the best minds -- which must include this particular anonymonk -- apparently looking for diversion on a Lazy Sunday, and got three excellent ones to choose from. And, being me, once they were all coded & running, the only way I was going to choose was with a benchmark :) BTW: the third algorithm, labeled "reversive" in the spoiler above, is the first one suggested, by bitingduck, that would obviously work. It does 3 successive reverses of the buffer (thus two full passes):
I originally dismissed this/saved it as a last resort, because (I thought) that it would do far more swaps than was necessary. In reality, it does exactly the same as the recursive solution (1 less) in every situation I've tested. And it is very clean to program:
The only saving graces from that summary dismissal is that a) bitingduck was himself somewhat dismissive of it; b) it led to a highly entertaining thread; far more analysis than I would ever have done otherwise; and the brilliant outcome of an algorithm that seems to perform near optimally under all the circumstances I've thrown at it. I hope other people were as entertained by it as I was. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] [d/l] [select] |
by bitingduck (Deacon) on Mar 02, 2015 at 15:33 UTC | |
I've been quite entertained by the whole thread all weekend- it's been my diversion from wrestling with getting a new little computer to talk to various data acq devices using little C snippets to do the direct talking and wrapping Perl around it to make it faster to code up and modify. Fortunately I have most of it wired up and set for remote access, so I could work from the comfort of my couch much of the time. I'm remembering how much C I've forgotten... The algorithm that surprises me the most is the recursive, in that you didn't end up with a lot of stack overhead slowing it down too much or hanging it up-- even just pushing return addresses on the stack and no data it's going to get large for your datasets. Once you showed the manual swaps, I was sure iterative was going to be the answer. Most of what I thought the reversing algorithm had going for it was a) simple to code, b) sure to work without debugging, and c) you can probably fit it in about 20 bytes of code if you get sent back to 1983. | [reply] |
by BrowserUk (Patriarch) on Mar 02, 2015 at 19:38 UTC | |
The algorithm that surprises me the most is the recursive, in that you didn't end up with a lot of stack overhead slowing it down too much or hanging it up-- even just pushing return addresses on the stack and no data it's going to get large for your datasets. I did try to find a pathological case for the recursive version. Using a simple sub it is easy to see the steps it goes through for a particular set of parameters:
It pathological case is when there is a difference of just 1 element between the two buffers. The first step moves the smaller buffer into its final position in the one go; but then the odd byte has to be 'rippled' through the rest of the larger buffer to get it (and the rest of the larger buffer) into their final positions. So then I tried running it with 2^29 2^28-1 (but turn of the reporting and output just the final number of steps:
134 million steps, with all but one moving 1 byte one place at a time. The prospects didn't look good. As you say, that'd involve 134 million 8-byte return addresses on the stack. Except it didn't. I saw no memory growth at all. Which could only mean that the compiler had tail-call optimised the recursion way. And sure enough, looking at the asm it has. It also eliminated the duplicated y == size comparison: <Reveal this spoiler or all in this thread>
So, when I fed those parameters into the real code:
Nada! No pathological behaviour. That one-at-a-time ripple may look/sound laborious, but its basically a single run through memory, like copying a string, that the hardware and caches are designed to optimise for. Hence why I never bothered to test the iterative version of the algorithm that anonymonk posted above. The compiler made a better job of the conversion. (Besides, then I wouldn't have been able to call it the recursive algorithm; and I so like my 'iterative'/'recursive'/'reversive' labels :) Most of what I thought the reversing algorithm had going for it was a) simple to code, b) sure to work without debugging, and c) you can probably fit it in about 20 bytes of code if you get sent back to 1983. :) There is definitely something to be said for simple! With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] [d/l] [select] |