in reply to [OT] A measure of 'sortedness'?

Does the sequential pass of the buffers have to be done first?

Given that the merge sort runs through each list fully in the process of merging, could you not make the results of the scan a side effect of the merge?

I suppose you might be able to determine an upper limit during your sequential scans, to see how much buffer space would be required to do the merge without swapping.

On the other hand, even if you have to do some swapping in order to keep the buffer to a fixed value, you're still going to get linear worst case (starting with list A before list B, if all the Bs sort to the front, then you'll have to swap each A into the vacated B positions before moving them once more to their final position.

--AAAAABBBBB #B's sort before A's, - is empty buffer B-AAAAA-BBBB BBAAAAA--BBB BB-AAAAa-BBB #A moved to a to make room BBBAAAAa--BB BBB-AAAaa-BB BBBBAAAaa--B BBBB-AAaaa-B BBBBBAAaaa-- BBBBB-Aaaaa- BBBBBaA-aaa- #Each A ends up moving twice; O(1.5N), linear. BBBBBa-aaaa- BBBBBaa-aaa- BBBBBaaa-aa- BBBBBaaaa-a- BBBBBaaaaa-- # Plus 1.0N if you want to shift the set # back to the original address range: --BBBBBaaaaa

Perhaps the more important question is in the nature of the items. How much work does it take to determine their value/sort order, and how much work does it take to move them?

If they are complex objects with an expensive sort function, you could make the initial pass pre-calculate that. If the values of sequential items are correlated, you could possibly save time and/or brainpower doing it up front.

If the moves are particularly expensive, then the pre-calculations getting a measure of sortedness could provide an improved time to completion estimate by guessing how many items need to be swapped twice to do it in-place.

PseudoEdit:
sauoc's idea sounds like a way to optimize down the number random disk accesses by doing bunches of objects at a time. I got the impression your data was all in memory however.

Replies are listed 'Best First'.
Re^2: [OT] A measure of 'sortedness'?
by BrowserUk (Patriarch) on Mar 18, 2015 at 23:02 UTC
    Does the sequential pass of the buffers have to be done first?

    Yes.

    Perhaps the more important question is in the nature of the items.

    I think I mentioned "call them numbers". The point is that they are fixed length records; so even if there are multiple numbers, or a string, the comparison function will be very simple, so not much mileage there.

    I got the impression your data was all in memory however.

    Indeed. Whilst the data originates on disk and is bigger than memory; the two buffers being merged here are both fully in memory, but combined are close to the limits of memory, hence not enough space to perform the n-way merge.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      Whilst the data originates on disk and is bigger than memory; the two buffers being merged here are both fully in memory, but combined are close to the limits of memory, hence not enough space to perform the n-way merge.

      If the overall pair of data sets are bigger than memory, thus requiring doing it in chunks that fit in memory, why not smaller chunks?

        why not smaller chunks?

        Because eventually, you need to merge the smaller chunks into bigger ones. That includes the ones bigger than memory, but by spltting into 1/2 memory sized chunks, you can merge them in pairs:

        A B C D Each 2GB A&B B&C C&D The largest have migrated from A to D, no need to revis +it B&C A&B The smallest have migrated from D to A no need to revis +it B&C And final pass ensures everything is in place.

        Using smaller buffers only delays the inevitable and increases the number of passes.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked