Actually, it turns out its not (just) the number of swaps made.
Even when the iterative version makes the same number (actually 1 more on a count of 536 million+); I can induce it to produce pathological timings (23.6 seconds versus 3.6 seconds):
C:\test\C>bufswap 536870855 268435399 size:536870855 offset;268435399 [ 0 1 2 3 4 5 6 7 8 9 ...268435389 268435 +390 268435391 268435392 2684353 ... [268435399 268435400 268435401 268435402 268435403 268435404 268435405 + 268435406 268435407 268435408 ... iterative: swaps:536870855 took 23.625009728 secs. [ 0 1 2 3 4 5 6 7 8 9 ...268435389 268435 +390 268435391 268435392 2684353 ... [268435399 268435400 268435401 268435402 268435403 268435404 268435405 + 268435406 268435407 268435408 ... recursive: swaps:536870854 took 3.580721198 secs.
Looking back at the trace output I generated from the perl versions, the only thing I can put the difference down to is cache misses.
By choosing a left buffer of 268,435,456 U64 elements, (for a 2GB allocation) and a right buffer of 268,435,399 U64s, (the largest prime less), it induces the iterative versions modulo arithmetic to hit elements effectively randomly throughout the buffer; which cause massive cache misses.
The recursive version tends to work through the array sequentially, minimising the cache misses and demostrating the importance of doing so very nicely.
Took me a while to figure out the reason, but I've never seen a better demonstration of the importance of locality of reference.
Which now is going to make me go back and look at bitingduck's triple-reverse mechanism.
In reply to Re^4: [OT] Swapping buffers in place.
by BrowserUk
in thread [OT] Swapping buffers in place.
by BrowserUk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |