in reply to Re^2: [OT] Swapping buffers in place.
in thread [OT] Swapping buffers in place.

Well the basic idea is this: suppose we have an array like
1 2 3 4 5 6 a b c
initially, x points to 1, y points to a:
x y 1 2 3 4 5 6 a b c
We want to move a b c all the way to the left,so we swap numbers and letters:
x y a b c 4 5 6 1 2 3
Now the rightmost elements are scrambled. Note that y at this stage points to the element one off the right end of the array. We restore its position:
$y = $offset_y if $y == @$ary; x y a b c 4 5 6 1 2 3
Now we basically have a smaller array 4 5 6 1 2 3 and we want to swap 4 5 6 and 1 2 3. Hence, recursion. But it's a tail call, aka 'while loop in disguise', so:
sub do_swaps { my ( $ary, $offset_x, $offset_y, $swaps ) = @_; my ( $x, $y ) = ( $offset_x, $offset_y ); while ( $offset_y != @$ary ) { for ( $offset_x .. ( $offset_y - 1 ) ) { $y = $offset_y if $y == @$ary; my $temp = $ary->[$x]; $ary->[$x] = $ary->[$y]; $ary->[$y] = $temp; $x += 1; $y += 1; $swaps += 1; } ( $offset_x, $offset_y ) = ( $x, $y ); } return $swaps; }
Same thing... I think :)

Replies are listed 'Best First'.
Re^4: [OT] Swapping buffers in place.
by BrowserUk (Patriarch) on Mar 02, 2015 at 02:12 UTC

    Actually, it turns out its not (just) the number of swaps made.

    Even when the iterative version makes the same number (actually 1 more on a count of 536 million+); I can induce it to produce pathological timings (23.6 seconds versus 3.6 seconds):

    C:\test\C>bufswap 536870855 268435399 size:536870855 offset;268435399 [ 0 1 2 3 4 5 6 7 8 9 ...268435389 268435 +390 268435391 268435392 2684353 ... [268435399 268435400 268435401 268435402 268435403 268435404 268435405 + 268435406 268435407 268435408 ... iterative: swaps:536870855 took 23.625009728 secs. [ 0 1 2 3 4 5 6 7 8 9 ...268435389 268435 +390 268435391 268435392 2684353 ... [268435399 268435400 268435401 268435402 268435403 268435404 268435405 + 268435406 268435407 268435408 ... recursive: swaps:536870854 took 3.580721198 secs.

    Looking back at the trace output I generated from the perl versions, the only thing I can put the difference down to is cache misses.

    By choosing a left buffer of 268,435,456 U64 elements, (for a 2GB allocation) and a right buffer of 268,435,399 U64s, (the largest prime less), it induces the iterative versions modulo arithmetic to hit elements effectively randomly throughout the buffer; which cause massive cache misses.

    The recursive version tends to work through the array sequentially, minimising the cache misses and demostrating the importance of doing so very nicely.

    Took me a while to figure out the reason, but I've never seen a better demonstration of the importance of locality of reference.

    Which now is going to make me go back and look at bitingduck's triple-reverse mechanism.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      BrowserUk:

      Yeah, I don't expect it would be a cache-friendly algorithm. I didn't expect you were swapping buffers that large, or I'd've probably gone with something that would do blocks at a time, rather than entries. I didn't check the assembler, either, but I'd the extra conditionals mine does would also jam things up.

      Anonymonk's version looks pretty nifty, so I'll have to go over it to be sure I know what it's doing. ;^)

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        Yeah, I don't expect it would be a cache-friendly algorithm. I didn't expect you were swapping buffers that large, or I'd've probably gone with something that would do blocks at a time, rather than entries.

        The pathological behaviour only exhibits with buffer/offset pairings specifically chosen to cause it.

        The paramaters in the following run, 536,870,912/268,435,456 represent an exactly 2GB buffer with the partition half way along:

        C:\test\C>bufswap 536870912 268435456 2 ### 2^29 2^28 size:536870912 offset;268435456 [ 0 1 ... 268435454 268435455 ^ 268435456 268 +435457 ... 536870910 536870911 ] [ 268435456 268435457 ... 536870910 536870911 ^ 0 + 1 ... 268435454 268435455 ] iterative: swaps:536870912 took 4.364767872 secs. [ 0 1 ... 268435454 268435455 ^ 268435456 268 +435457 ... 536870910 536870911 ] [ 268435456 268435457 ... 536870910 536870911 ^ 0 + 1 ... 268435454 268435455 ] recursive: swaps:268435456 took 2.476348362 secs.

        Note that although the iterative algorithm does double the swaps, (equal to the total number of U64 elements in the buffer ) of the recursive version, the time taken is substantially less than double (175% .v. 200%) than the time taken by the recursive version. So, there is nothing wrong with the efficiency of the implementation of the algorithm.

        And in many (maybe most) buffer/offset ratios, they both do the same number of swaps:

        By anyone's standards, performing 1/4 billion swaps of two, 64-bit values, shifting 4GB of data around in the process, all in 4.36 seconds, ain't at all tardy!

        Oh to be able to get the data off my discs fast enough to make use of that ~1GB/s throughput. The best I get out of my HD is about 200MB/s. My new SSD gets close, but only if the cache is warm.

        Also note that the total memory above is only 57 x U64s (456 bytes) different to the pathological case, where the buffer/offset pairing is 536,870,855/268,435,399 (2GB+largest prime smaller/largest prime smaller):

        C:\test\C>bufswap 536870855 268435399 2 ### 2^28+prime prime size:536870855 offset;268435399 [ 0 1 ... 268435397 268435398 ^ 268435399 268 +435400 ... 536870853 536870854 ] [ 268435399 268435400 ... 536870853 536870854 ^ 0 + 1 ... 268435397 268435398 ] iterative: swaps:536870855 took 23.912365889 secs. [ 0 1 ... 268435397 268435398 ^ 268435399 268 +435400 ... 536870853 536870854 ] [ 268435399 268435400 ... 536870853 536870854 ^ 0 + 1 ... 268435397 268435398 ] recursive: swaps:536870854 took 3.633964430 secs.

        Note that the number of swaps are only 1 different, but the time is 658% longer. That threw me through a lot of hoops!

        Even though I chose the numbers, (power of 2/largestest prime smaller; based on previous experience of a modulo arithmetic driven process), to exacerbate the scattering effect of the modulo arithmetic -- maximising the number of times the pointer has to wrap around -- it took me completely by surprise at the size of the difference it made. I went over and over and over the code looking for some cock-up before the reality of cache misses hit me.

        But I also 'lucked out'. I've since tried various other power-of-two/largest prime smaller combinations, and none of them come close to triggering the same kind of differences:

        And it is only when the buffer size gets up into the GB range that the shear space which the modulo arithmetic has to distributes the wraps in, ultimately defeats the combined L1/L2/L3 caches and their LRU algorithms and starts to slow things down.

        I set out to find an algorithm. I lucked out that I had some of the best minds -- which must include this particular anonymonk -- apparently looking for diversion on a Lazy Sunday, and got three excellent ones to choose from. And, being me, once they were all coded & running, the only way I was going to choose was with a benchmark :)


        BTW: the third algorithm, labeled "reversive" in the spoiler above, is the first one suggested, by bitingduck, that would obviously work. It does 3 successive reverses of the buffer (thus two full passes):

        abcdef123 321fedcba reverse the entire buffer. 123fedcba reverse the smaller partition. 123abcdef reverse the larger partition.

        I originally dismissed this/saved it as a last resort, because (I thought) that it would do far more swaps than was necessary. In reality, it does exactly the same as the recursive solution (1 less) in every situation I've tested. And it is very clean to program:

        U32 reverseBuffer( register U64 *p, const U32 start, const U32 size ) +{ register U32 i, swaps = 0; const U32 half = size >> 1; for( i = start; i < half; ++i ) { swapElems( p, i, size -1 -i ); ++swaps; } return swaps; } U32 xchgBufferV( U64 *p, U32 size, U32 offset ) { register U32 swaps = 0; swaps += reverseBuffer( p, 0, size ); swaps += reverseBuffer( p, 0, size - offset ); swaps += reverseBuffer( p + (size - offset), 0, offset ); return swaps; }

        The only saving graces from that summary dismissal is that a) bitingduck was himself somewhat dismissive of it; b) it led to a highly entertaining thread; far more analysis than I would ever have done otherwise; and the brilliant outcome of an algorithm that seems to perform near optimally under all the circumstances I've thrown at it.

        I hope other people were as entertained by it as I was.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^4: [OT] Swapping buffers in place.
by BrowserUk (Patriarch) on Mar 02, 2015 at 00:57 UTC

    And we have a (provisional) winner! (Needs more testing.)

    Your (my C version of) tail recursive algorithm beats (my C version of) roboticus' implementation of his/graff's/hdb's/bitingduck's iterative algorithm on a 3GB buffer with a 2GB offset.

    The output shows 10 elements at the beginning; either side of the offset; and at the end; before and after the swap:

    C:\test\C>bufswap 402653184 268435456 size:402653184 offset;268435456 [ 0 1 2 3 4 5 6 7 8 9 ...268435446 268435 +447 268435448 268435449 268435450 268435451 268435452 268435453 26843 +5454 268435455 268435456 268435457 268435458 268435459 268435460 2684 +35461 268435462 268435463 268435464 268435465 ...402653174 402653175 +402653176 402653177 402653178 402653179 402653180 402653181 402653182 + 402653183 ] [268435456 268435457 268435458 268435459 268435460 268435461 268435462 + 268435463 268435464 268435465 ...134217718 134217719 134217720 13421 +7721 134217722 134217723 134217724 134217725 134217726 134217727 1342 +17728 134217729 134217730 134217731 134217732 134217733 134217734 134 +217735 134217736 134217737 ...268435446 268435447 268435448 268435449 + 268435450 268435451 268435452 268435453 268435454 268435455 ] iterative took 3.217107510 [ 0 1 2 3 4 5 6 7 8 9 ...268435446 268435 +447 268435448 268435449 268435450 268435451 268435452 268435453 26843 +5454 268435455 268435456 268435457 268435458 268435459 268435460 2684 +35461 268435462 268435463 268435464 268435465 ...402653174 402653175 +402653176 402653177 402653178 402653179 402653180 402653181 402653182 + 402653183 ] [268435456 268435457 268435458 268435459 268435460 268435461 268435462 + 268435463 268435464 268435465 ...134217718 134217719 134217720 13421 +7721 134217722 134217723 134217724 134217725 134217726 134217727 1342 +17728 134217729 134217730 134217731 134217732 134217733 134217734 134 +217735 134217736 134217737 ...268435446 268435447 268435448 268435449 + 268435450 268435451 268435452 268435453 268435454 268435455 ] recursive took 2.445034532

    The C code (needs some sanity checks):

    /* Drawn from roboticus' (1118303) & hdb's(1118256) iterative implemen +tations (Algorithm also described by graff & bitingduck) */ void xlateBuffersI( register U64 *p, const U32 size, const U32 offset +) { U32 register cnt = size, dst = 0, cmp = 0; U64 register tmp = p[ dst ]; assert( offset < size ); assert( ( (U64)p & 7ull ) == 0 ); while( cnt-- ) { U32 register src = ( dst + offset ) % size; if( src == cmp ) { p[ dst ] = tmp; dst = ++cmp; tmp = p[ dst ]; } else { p[ dst ] = p[ src ]; dst = src; } } return; } __forceinline void swapElems( register U64 *p, const U32 x, const U32 +y ) { register U64 temp = p[ x ]; p[ x ] = p[ y ]; p[ y ] = temp; } /* Adapted from anonymonk's post 1118262 */ void xlateBuffersR( U64 *p, U32 size, U32 x, U32 y ) { U32 i, savedY = y; if( y == size ) return; for( i = x; i < savedY; ++i ) { if( y == size ) y = savedY; swapElems( p, x++, y++ ); } xlateBuffersR( p, size, x, y ); }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      Glad to help, BUK. Make sure you test it well, cause I'm not at all confident there aren't bugs in my code ;) But the idea is sound, IMO.
Re^4: [OT] Swapping buffers in place.
by Anonymous Monk on Mar 02, 2015 at 00:06 UTC
    'Busier' example:
    x y a b 1 2 3 4 5 6 7 x y 1 2 a b 3 4 5 6 7 -- recurse x y 1 2 3 4 a b 5 6 7 -- recurse x y 1 2 3 4 5 6 a b 7 x y 1 2 3 4 5 6 7 b a -- first swap on this recusion level x y 1 2 3 4 5 6 7 b a -- restore y's position x y 1 2 3 4 5 6 7 a b -- second swap and recurse $y == @$ary now, so stop.
    Something like that...
      x y 1 2 3 4 5 6 7 a b -- second swap and recurse
      off by one :) should be, of course:
      x y 1 2 3 4 5 6 7 a b -- second swap and recurse
Re^4: [OT] Swapping buffers in place.
by BrowserUk (Patriarch) on Mar 02, 2015 at 01:19 UTC

    I added back your swap count code and added some to the iterative version, and its easy to see the reason yours is faster:

    C:\test\C>bufswap 402653184 268435456 size:402653184 offset;268435456 [ 0 1 2 3 4 5 6 7 8 9 ...268435446 268435 +447 268435448 268435449 268435450 268435451 268435452 268435453 26843 +5454 268435455 268435456 268435457 268435458 268435459 268435460 2684 +35461 268435462 268435463 268435464 268435465 ...402653174 402653175 +402653176 402653177 402653178 402653179 402653180 402653181 402653182 + 402653183 ] [268435456 268435457 268435458 268435459 268435460 268435461 268435462 + 268435463 268435464 268435465 ...134217718 134217719 134217720 13421 +7721 134217722 134217723 134217724 134217725 134217726 134217727 1342 +17728 134217729 134217730 134217731 134217732 134217733 134217734 134 +217735 134217736 134217737 ...268435446 268435447 268435448 268435449 + 268435450 268435451 268435452 268435453 268435454 268435455 ] iterative: swaps:402653184 took 3.218781038 secs. [ 0 1 2 3 4 5 6 7 8 9 ...268435446 268435 +447 268435448 268435449 268435450 268435451 268435452 268435453 26843 +5454 268435455 268435456 268435457 268435458 268435459 268435460 2684 +35461 268435462 268435463 268435464 268435465 ...402653174 402653175 +402653176 402653177 402653178 402653179 402653180 402653181 402653182 + 402653183 ] [268435456 268435457 268435458 268435459 268435460 268435461 268435462 + 268435463 268435464 268435465 ...134217718 134217719 134217720 13421 +7721 134217722 134217723 134217724 134217725 134217726 134217727 1342 +17728 134217729 134217730 134217731 134217732 134217733 134217734 134 +217735 134217736 134217737 ...268435446 268435447 268435448 268435449 + 268435450 268435451 268435452 268435453 268435454 268435455 ] recursive: swaps:268435456 took 2.436202970 secs.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^4: [OT] Swapping buffers in place.
by Anonymous Monk on Mar 02, 2015 at 01:57 UTC
    Now we basically have a smaller array 4 5 6 1 2 3 and we want to swap 4 5 6 and 1 2 3. Hence, recursion.
    (actually, in this particular example we still have 3 iterations left in the for loop. The function recurses if the size of the left part of the array is less then or equal to the right part)
      (damn, I must be tired. disregard that part about recursion :)