Re^4: [OT] A measure of 'sortedness'?

To make it as easy as possible for myself (as well as fast as I could), I added a pointer field to each record and made a linked list of the records as I read them in. Then I used the radix sort algorithm to rearrange the pointers, without moving the record bodies.

With typically 8-byte or 16-byte records and enough of them to require > 32-bit indexes, I'd need 64-bit pointers to build even a single linked list which would greatly exacerbate the memory problem. And in the end, as I'm working with memory mapped files, the records would have to move around anyway.

I had another thought as well--If your keys are randomly distributed throughout the key space, and you know the number of records beforehand, you could make a pretty reasonable guess of the approximate final location of each record based on its key.

As I alluded to in my reply to anonymonks query about bucket sort; the range of the values is so huge, that even with 100GB files the cover only a tiny percentage of the range; but with an unknown distribution..

If you used that knowledge while creating the file ... I don't know how involved you are in creating the original file.

This isn't just one file, or one type of file, but actually any of a number of different types of file. My sort utility is aimed to be generic for any kind of fixed record length file. There is thus no way to quantify the distribution without a full pass of the file being sorted; and it varies greatly for different filetypes.

For example, in one type, the numbers represent huge file offsets (indexing much, much larger files), thus the numbers will be unique and concentrated at the low end of the range -- lots of leading zero bytes.

Another type the 64-bit numbers are actually two, interleaved 32-bit numbers (ie. Morton Numbers) that give 2D coordinates a 1D spatial ordering, in which case they can be clustered in small, tight clumps but distributed over the full range. They can also be short strings with associated offsets into huge, variable length record files. Or numbers and offsets. Etc.

And that kind of indicates where I'm at with regard to radix sorts; Sorting into 256 buckets by byte-by-byte, for many of the file types, the first three bytes would be usless, putting all the records in the first bucket. For others, the first byte will be restricted to just 26 values, which for a 100GB file is perfect -- ~4GB in each populated bucket; but these things only ever get larger and 250+GB is already a reality. Then you are talking ~10GB per bucket which gets impractical. For my machine at least.

Thinking about it; I guess you could start the bucket/radix sort using the least significant byte rather then the most; but then you need to merge again from all the buckets simultaneously.

The only workable alternative to the "what range" problem I can think of requires a full-pre-scan of the entire file. Which sounds awful, but may actually be more practical than I previously thought.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Comment on Re^4: [OT] A measure of 'sortedness'?

Replies are listed 'Best First'.
Re^5: [OT] A measure of 'sortedness'? by roboticus (Chancellor) on Mar 20, 2015 at 12:11 UTC
BrowserUk: Yeah, if the key distribution isn't uniform, that second idea flies right out the window. Since your records are so small, overhead of pointers would suck. On the other hand, since the records are so small, you'd just swap the records around anyway, as you wouldn't gain anything to speak of by simply shuffling the pointers. With respect to the radix sort, though: Since it's a stable sort, it has a nice feature: instead of having to finish up with a merge, you just concatenate the lists end-to-end. That won't help you in this case, as you don't want to make many passes over your file to sort it, but it's something you can take advantage of in some situations. It sounds like you may have to go the route taken by the team that did that funky tree you talked about a few weeks ago (don't recall the name of it ... Judy tree, perhaps?), in that you may have to have multiple algorithms and adaptively use them based on the character of the keys/record lengths/other attributes of your file. I have, however, no ideas for how you'd go about implementing such a thing, though. Another thought--It's pretty whacky, though, so stop here if you have a weak stomach. If you know the page size of your filesystem, if you could arrange your block rearrangement to target those block sizes, you could swap 'em by editing the filesystem metadata. (I did a FAT-12/FAT-16/FAT-32 filesystem at one time. While I didn't do any edits like that, it wouldn't be difficult in that case.) I don't know a lot about the internals of modern filesystems, but you might be able to find a filesystem you could run under windows that you could modify to give you that ability. I did warn you that it was a whacky idea... ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]

Replies are listed 'Best First'.

Re^5: [OT] A measure of 'sortedness'?
by roboticus (Chancellor) on Mar 20, 2015 at 12:11 UTC

BrowserUk:

Yeah, if the key distribution isn't uniform, that second idea flies right out the window. Since your records are so small, overhead of pointers would suck. On the other hand, since the records *are* so small, you'd just swap the records around anyway, as you wouldn't gain anything to speak of by simply shuffling the pointers.

With respect to the radix sort, though: Since it's a stable sort, it has a nice feature: instead of having to finish up with a merge, you just concatenate the lists end-to-end. That won't help you in this case, as you don't want to make many passes over your file to sort it, but it's something you can take advantage of in some situations.

It sounds like you may have to go the route taken by the team that did that funky tree you talked about a few weeks ago (don't recall the name of it ... Judy tree, perhaps?), in that you may have to have multiple algorithms and adaptively use them based on the character of the keys/record lengths/other attributes of your file. I have, however, no ideas for how you'd go about implementing such a thing, though.

Another thought--It's pretty whacky, though, so stop here if you have a weak stomach.

If you know the page size of your filesystem, if you could arrange your block rearrangement to target those block sizes, you could swap 'em by editing the filesystem metadata. (I did a FAT-12/FAT-16/FAT-32 filesystem at one time. While I didn't do any edits like that, it wouldn't be difficult in that case.) I don't know a lot about the internals of modern filesystems, but you might be able to find a filesystem you could run under windows that you could modify to give you that ability.

I *did* warn you that it was a whacky idea...

...roboticus

When your only tool is a hammer, all problems look like your thumb.

[reply]