in reply to Re^3: [OT] A measure of 'sortedness'?
in thread [OT] A measure of 'sortedness'?
To make it as easy as possible for myself (as well as fast as I could), I added a pointer field to each record and made a linked list of the records as I read them in. Then I used the radix sort algorithm to rearrange the pointers, without moving the record bodies.
With typically 8-byte or 16-byte records and enough of them to require > 32-bit indexes, I'd need 64-bit pointers to build even a single linked list which would greatly exacerbate the memory problem. And in the end, as I'm working with memory mapped files, the records would have to move around anyway.
I had another thought as well--If your keys are randomly distributed throughout the key space, and you know the number of records beforehand, you could make a pretty reasonable guess of the approximate final location of each record based on its key.
As I alluded to in my reply to anonymonks query about bucket sort; the range of the values is so huge, that even with 100GB files the cover only a tiny percentage of the range; but with an unknown distribution..
If you used that knowledge while creating the file ... I don't know how involved you are in creating the original file.
This isn't just one file, or one type of file, but actually any of a number of different types of file. My sort utility is aimed to be generic for any kind of fixed record length file. There is thus no way to quantify the distribution without a full pass of the file being sorted; and it varies greatly for different filetypes.
For example, in one type, the numbers represent huge file offsets (indexing much, much larger files), thus the numbers will be unique and concentrated at the low end of the range -- lots of leading zero bytes.
Another type the 64-bit numbers are actually two, interleaved 32-bit numbers (ie. Morton Numbers) that give 2D coordinates a 1D spatial ordering, in which case they can be clustered in small, tight clumps but distributed over the full range. They can also be short strings with associated offsets into huge, variable length record files. Or numbers and offsets. Etc.
And that kind of indicates where I'm at with regard to radix sorts; Sorting into 256 buckets by byte-by-byte, for many of the file types, the first three bytes would be usless, putting all the records in the first bucket. For others, the first byte will be restricted to just 26 values, which for a 100GB file is perfect -- ~4GB in each populated bucket; but these things only ever get larger and 250+GB is already a reality. Then you are talking ~10GB per bucket which gets impractical. For my machine at least.
Thinking about it; I guess you could start the bucket/radix sort using the least significant byte rather then the most; but then you need to merge again from all the buckets simultaneously.
The only workable alternative to the "what range" problem I can think of requires a full-pre-scan of the entire file. Which sounds awful, but may actually be more practical than I previously thought.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: [OT] A measure of 'sortedness'?
by roboticus (Chancellor) on Mar 20, 2015 at 12:11 UTC |