in reply to [OT] A measure of 'sortedness'?

Pardon the YX kind of question, but... why isn't bucket sort (on the whole dataset) applicable in your situation?

Replies are listed 'Best First'.
Re^2: [OT] A measure of 'sortedness'?
by BrowserUk (Patriarch) on Mar 19, 2015 at 21:01 UTC
    why isn't bucket sort (on the whole dataset) applicable in your situation?

    1. How do you divide a (minimum) input range of 2^64, with a typical 100GB dataset representing just 0.00000007275957614183% of that range; into buckets?
    2. Because it is O(N2) worst case.

      But no indication of what provokes that worst case.

    3. Because it requires O(N+K) space worst case.

      (With no indication that I can see for what K is.)

    4. Because I haven't seen anything about a bucket sort that makes it a candidate for the problem?

      (A radix sort has possibilities; but I haven't gotten to pursuing them.)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      BrowserUk:

      This has been an interesting thread. I've been busy at work, so haven't really had a chance to think it over terribly much. I was kicking around the idea of radix sorting, so since you mentioned that it may have possibilities, I thought I'd pass along my experience with them. Simply put, I really liked the result. It was very fast.

      I just dug it out (hard to believe it was that long ago) and anonymized it. I may have over-anonymized it, so let me know if I broke it, so I can make suitable repairs. To make it as easy as possible for myself (as well as fast as I could), I added a pointer field to each record and made a linked list of the records as I read them in. Then I used the radix sort algorithm to rearrange the pointers, without moving the record bodies. Then I wrote the list in the indicated order.

      I had another thought as well--If your keys are randomly distributed throughout the key space, and you know the number of records beforehand, you could make a pretty reasonable guess of the approximate final location of each record based on its key. If you used that knowledge while creating the file (for example), you could build a file that was *nearly* sorted, and then use a sliding window to sort the file.

      Since you're thinking of huge buffers and swaps, I thought you could use the information like so: Read a block "A" and do your preprocessing pass. As part of that process, you could compute the block and location of the best guess for the record. You could then use that guess to tell you which block may be best to pair "A" against to get records into their final locations in as few swaps as possible.

      I don't know how involved you are in creating the original file, but if you have control over it, you may be able to possibly take advantage there as well.

      Maybe it'll trigger a useful thought...

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        To make it as easy as possible for myself (as well as fast as I could), I added a pointer field to each record and made a linked list of the records as I read them in. Then I used the radix sort algorithm to rearrange the pointers, without moving the record bodies.

        With typically 8-byte or 16-byte records and enough of them to require > 32-bit indexes, I'd need 64-bit pointers to build even a single linked list which would greatly exacerbate the memory problem. And in the end, as I'm working with memory mapped files, the records would have to move around anyway.

        I had another thought as well--If your keys are randomly distributed throughout the key space, and you know the number of records beforehand, you could make a pretty reasonable guess of the approximate final location of each record based on its key.

        As I alluded to in my reply to anonymonks query about bucket sort; the range of the values is so huge, that even with 100GB files the cover only a tiny percentage of the range; but with an unknown distribution..

        If you used that knowledge while creating the file ... I don't know how involved you are in creating the original file.

        This isn't just one file, or one type of file, but actually any of a number of different types of file. My sort utility is aimed to be generic for any kind of fixed record length file. There is thus no way to quantify the distribution without a full pass of the file being sorted; and it varies greatly for different filetypes.

        For example, in one type, the numbers represent huge file offsets (indexing much, much larger files), thus the numbers will be unique and concentrated at the low end of the range -- lots of leading zero bytes.

        Another type the 64-bit numbers are actually two, interleaved 32-bit numbers (ie. Morton Numbers) that give 2D coordinates a 1D spatial ordering, in which case they can be clustered in small, tight clumps but distributed over the full range. They can also be short strings with associated offsets into huge, variable length record files. Or numbers and offsets. Etc.

        And that kind of indicates where I'm at with regard to radix sorts; Sorting into 256 buckets by byte-by-byte, for many of the file types, the first three bytes would be usless, putting all the records in the first bucket. For others, the first byte will be restricted to just 26 values, which for a 100GB file is perfect -- ~4GB in each populated bucket; but these things only ever get larger and 250+GB is already a reality. Then you are talking ~10GB per bucket which gets impractical. For my machine at least.

        Thinking about it; I guess you could start the bucket/radix sort using the least significant byte rather then the most; but then you need to merge again from all the buckets simultaneously.

        The only workable alternative to the "what range" problem I can think of requires a full-pre-scan of the entire file. Which sounds awful, but may actually be more practical than I previously thought.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked