in reply to [OT] A measure of 'sortedness'?

Given:

@b1 = (1,1,1,4,4); @b2 = (1,2,3,4,6);

and desiring:

@b1 = (1,1,1,1,2); @b2 = (3,4,4,4,6);

Why not use a sort algorithm to sorts the virtual array consisting of @b1 + @b2, using a function that maps the indices?

sub ind { $i = shift; if ($i <= $#b1) { return (\$b1[$i]) } else { return (\$b2[$i-$#b1-1]) } }

This doesn't make any use of those free passes thru each buffer. But it also doesn't need an external buffer.

Dum Spiro Spero

Replies are listed 'Best First'.
Re^2: [OT] A measure of 'sortedness'?
by BrowserUk (Patriarch) on Mar 19, 2015 at 20:38 UTC
    Why not use a sort algorithm?

    The best sorts are at best O(N log N), but they are not adaptive to partially sorted data.

    I tried it using the crt qsort(), The first pass sorts 200 equal partitions of the array, which is actually quicker than sorting the entire array in one go.

    The second pass sorts the buffers together in pairs, moving left to right and back again in a shaker sort process until done:

    C:\test\C>mergeParts 200000000 200 qsort took 59.047691648 secs for 200 partitions of 200000000 element a +rray qsort took 7121.598593905 secs to merge 200 partitions of 200000000 el +ement array

    The killer here is the shaker sort required to bring all the bits together, which is (N*(N-1)):

    C:\test>BiDiBubbleSort -N=4 #### 4*3/2 = 6 pairs [ D<>C B A ]: comp&swap( 0, 1 ) D < C [ C D<>B A ]: comp&swap( 1, 2 ) D < B [ C B D<>A ]: comp&swap( 2, 3 ) D < A [ C B<>A D ]: comp&swap( 1, 2 ) B < A [ C<>A B D ]: comp&swap( 0, 1 ) C < A [ A C<>B D ]: comp&swap( 1, 2 ) C < B A B C D C:\test>BiDiBubbleSort -N=5 #### 5*4/2 = 10 pairs. [ E<>D C B A ]: comp&swap( 0, 1 ) E < D [ D E<>C B A ]: comp&swap( 1, 2 ) E < C [ D C E<>B A ]: comp&swap( 2, 3 ) E < B [ D C B E<>A ]: comp&swap( 3, 4 ) E < A [ D C B<>A E ]: comp&swap( 2, 3 ) B < A [ D C<>A B E ]: comp&swap( 1, 2 ) C < A [ D<>A C B E ]: comp&swap( 0, 1 ) D < A [ A D<>C B E ]: comp&swap( 1, 2 ) D < C [ A C D<>B E ]: comp&swap( 2, 3 ) D < B [ A C<>B D E ]: comp&swap( 1, 2 ) C < B A B C D E

    For 200 partitions, that/s 200*199/2 = 19,900 O(N log N) sorts required to merge the partitions.

    The only sorts that are adaptive to partially sorted data, are the O(N2) ones like insertion sort and shell sort; which makes them look great for the purpose when the number of elements is 30 or 50 items, but means that even with their adaptive properties, the aren't great for merging 2 runs of 1/4 a billion elements ( 2 * 2GB / 8-byte ). Better than qsort() , but still not good when run 20,000 times.

    An N-way merge in O(N), but requires an unlimited output buffer, ie. disk; and that is 3000 times slower than (cached) memory. If I can merge in place whilst avoiding moving every item 3000 times, I should save time. An O(N) in-place merge is possible, but the algorithms I've found described are complex, messy, badly described and a bitch to program.

    I have "discovered" another algorithm; still fairly complex, but two-stage, mutually tail-recursive, with both parts being relatively clean in the small.

    The first part deals with two, equal length inputs and requires a (partial) linear scan of both buffers to split the two into 2 parts each. Of those 4 parts, 2, (1 in each of the original partitions) requires either no further processing, or a simple, in-place, equal length swap. (Which also, by the by, completes the partial scans.)

    The other two parts are now recursed upon. Classic divide and conquer strategy.

    Aside: The qsort's greatest weakness -- the O(N2) reverse-sorted input -- can be trivially dealt with by using a single initial pass over the data:

    for( p1 = start, p2 = end; p1 < p2; ++p1, --p2 ) if( a[p1] < a[p2] ) s +wap( p1, p2 ).

    If the buffer was reverse sorted, it is now sorted. If you count the swaps and swaps == int(items/2), you're done. O(N) for qsort's nightmare. If the input was already sorted, no swaps and you're also done. Also O(N). Anywhere in between, and you've done no harm. Worst case now becomes: Biggest, third biggest, fifth biggist; ...; sixth biggest, fourth biggest, second biggest, which isn't helped by the pre-sort; but neither is it harmed; but is also far, far less likely than reverse sorted. This is a similar approach to pre-shuffling to avoid the problems, accept it benefits hugely in two common cases. The "discovery" of the pre-scan aided (avoided) the processing.

    And that's where this thread arose: As I'm doing sometimes partial/sometimes full scan in the first stage; can I gather any information during it, that can beneficially inform the second stage. So far, the answer seems to be no; but it triggered a couple of ideas that were worth investigating.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked