BrowserUk:

This has been an interesting thread. I've been busy at work, so haven't really had a chance to think it over terribly much. I was kicking around the idea of radix sorting, so since you mentioned that it may have possibilities, I thought I'd pass along my experience with them. Simply put, I really liked the result. It was very fast.

I just dug it out (hard to believe it was that long ago) and anonymized it. I may have over-anonymized it, so let me know if I broke it, so I can make suitable repairs. To make it as easy as possible for myself (as well as fast as I could), I added a pointer field to each record and made a linked list of the records as I read them in. Then I used the radix sort algorithm to rearrange the pointers, without moving the record bodies. Then I wrote the list in the indicated order.

$ cat RadixSort.cpp /*-------------------------------------------------------------------- +------- * 20031117 - originally written *-------------------------------------------------------------------- +------- */ #include <assert.h> #include <io.h> #pragma pack(push,1) struct TxnRec { TxnRec *next; char key[8]; // The rest of the record... }; TxnRec * root; TxnRec * last; // Read the records, chaining them together void read_transactions(const char *FName) { // Read and format the record TxnRec *p = get_transaction(IFH); root = last = p; while (p = get_transaction(IFH)) { last->next = p; last = p; } } void sort_records() { assert(root); // heads & tails of our buckets static TxnRec *multiroot[256]; static TxnRec *multitail[256]; // Radix sort from rightmost byte to leftmost for (int i=7; i>=0; --i) { // Clear traces of any prior sort(s)/pass(es) memset(multiroot, 0, sizeof(multiroot)); memset(multitail, 0, sizeof(multitail)); // Break the chain into 256 chains based on the key position TxnRec *p = root; while (p) { // Keep a handle on the next record to process TxnRec *next = p->next; // Detach it from the chain p->next=0; unsigned char c = p->key[i]; if (!multiroot[c]) { // First item in bucket multiroot[idx] = p; multitail[idx] = p; } else { // Add it to the end multitail[idx]->next = p; multitail[idx] = p; } p = next; } // And then stitch them together end-to-end root = 0; TxnRec *tail = 0; for (int j=0; j<256; j++) { if (multiroot[j]) { if (!root) { root = multiroot[j]; tail = multitail[j]; } else { tail->next = multiroot[j]; tail = multitail[j]; } } } } }

I had another thought as well--If your keys are randomly distributed throughout the key space, and you know the number of records beforehand, you could make a pretty reasonable guess of the approximate final location of each record based on its key. If you used that knowledge while creating the file (for example), you could build a file that was *nearly* sorted, and then use a sliding window to sort the file.

Since you're thinking of huge buffers and swaps, I thought you could use the information like so: Read a block "A" and do your preprocessing pass. As part of that process, you could compute the block and location of the best guess for the record. You could then use that guess to tell you which block may be best to pair "A" against to get records into their final locations in as few swaps as possible.

I don't know how involved you are in creating the original file, but if you have control over it, you may be able to possibly take advantage there as well.

Maybe it'll trigger a useful thought...

...roboticus

When your only tool is a hammer, all problems look like your thumb.


In reply to Re^3: [OT] A measure of 'sortedness'? by roboticus
in thread [OT] A measure of 'sortedness'? by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.