Re^3: In-place sort with order assignment
by JavaFan (Canon) on Sep 20, 2010 at 14:06 UTC
|
Can you extract 10 million keys from a hash (without blowing memory); write them to a file; start an external process (without forcing the current process to be swapped out), that will read them from that file and write them to other files several times; before reading them back into memory from the file; all in 108 seconds?
On my computer? No, I wouldn't even be able to read in 10 million keys without blowing memory.
On your computer? I've no idea - I don't have access. You can try it out (after writing the code, you'll have your answer within 2 minutes). Or you can just dismiss out of hand, without knowing for sure. Your call - it's your problem after all.
| [reply] |
|
|
#! perl -slw
use strict;
our $N //= 10e6;
my %hash;
my $val = 'AAAAA';
undef $hash{ $val++ } for 1 .. $N;
my $start = time;
open SORT, '| sort > sorted' or die $!;
print SORT $_ while $_ = each %hash;
close SORT;
undef %hash;
open IN, '<', 'sorted' or die $!;
chomp(), $hash{ $_ } = $. while <IN>;
close IN;
my $elapsed = time - $start;
printf "Took %.6f seconds for $N items (%.5f per second)\n",
$elapsed, $elapsed / $N;
__END__
C:\test>junk41 -N=10e6
Took 225.000000 seconds for 10e6 items (0.00002 per second)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
$ perl 860849.pl
Took 131.000000 seconds for 10000000 items (0.00001 per second)
I think I could reach that 108 by putting the sorted file on a separate spindle with a thinner filesystem layer (e.g. no ext3 and LVM2).
I'm really pointing this out because it's quite attainable with a reasonably recent hard drive. Maybe even just SATA/PATA would do the trick. SSD surely would.
| [reply] [d/l] |
|
|
Re^3: In-place sort with order assignment (-u)
by tye (Sage) on Sep 21, 2010 at 05:28 UTC
|
My first attempt would be to skip the hash altogether and just run the 'keys' straight through "/usr/bin/sort -u" as they are extracted. But perhaps the hash is needed for other than eliminating duplicates. (Yes, I have "sort -u" on Windows -- it is just one of many useful things that came with cygwin.)
| [reply] |
|
|
Interesting thought. I have a long process running currently that I'd like to let finish without risking swapping/ time penalties, but I'll give it a go later.
I do wonder if it isn't just moving the goal posts though. Instead of sorting 10e6 unique words, it would mean sorting 300e6 to 1e9 words and then discarding the vast majority of them.
It would address the memory problem, but at N log N, with N*30 to n*100, the time cost might be prohibitive?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
Is "/usr/bin/sort -u" really so dumb as to sort all of the duplicates and then only eliminate them at the end? I would be a bit surprised as well as sad if that were so. I would also be surprised if it used a hash or similar structure that tends to thwart "divide and conquer" strategies that end up being required for huge sorts like /usr/bin/sort is designed for.
I'd expect a modified sorting algorithm that eliminates duplicates whenever they "touch". But I've also long wondered why I have yet to run into any discussions of a formal algorithm like that (not having searched for such). My somewhat wild guess is that the performance of such an algorithm would be closer to O(U*log(U)) than to O(D*log(D)) (U= number of unique items, D= total number of items including duplicates). A lower bound would be O(D+U*log(U)) and I suspect that is a pretty close bound.
As an aside, even with no customization of the sorting algorithm, a huge number of duplicates would speed up worst-case performance. Consider a huge list of just one item repeated over and over. Sorting it is O(N) because you need so many fewer comparisons because two things being equal means that they aren't "out of order" and so no further checking is required to pin down where each belongs in the output.
That makes a somewhat interesting case that can be estimated via observation. Generate huge lists of duplicates of equal size but with much different quantities of duplication and compare the run-times of '/usr/bin/sort' and '/usr/bin/sort -u' against them. Repeat for a wide range of list sizes. Perhaps also throw in Windows' sort.exe which doesn't have -u and might lack some optimizations related to duplicates? (No, I don't plan to spend any time running such an experiment any time soon.)
| [reply] |
|
|
|
|
|
|
|
|
|
|