in reply to Re: In-place sort with order assignment
in thread In-place sort with order assignment

And I'd look into finding the sort order at the moment you're creating the huge hash.

The big hash is effectively uniq'ing millions from hundreds or thousands of millions of inputs. Those inputs are "words" extracted from "lines" read from an input file. The sort is to order those "words". There's no way to determine the ordering without sorting.

But I would go a different way: write all the keys to a file, one key per line. Call sort(1). Read the sorted file. Assign $. to each value.

Can you extract 10 million keys from a hash (without blowing memory); write them to a file; start an external process (without forcing the current process to be swapped out), that will read them from that file and write them to other files several times; before reading them back into memory from the file; all in 108 seconds?

If so, you might be onto something.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy
  • Comment on Re^2: In-place sort with order assignment

Replies are listed 'Best First'.
Re^3: In-place sort with order assignment
by JavaFan (Canon) on Sep 20, 2010 at 14:06 UTC
    Can you extract 10 million keys from a hash (without blowing memory); write them to a file; start an external process (without forcing the current process to be swapped out), that will read them from that file and write them to other files several times; before reading them back into memory from the file; all in 108 seconds?
    On my computer? No, I wouldn't even be able to read in 10 million keys without blowing memory.

    On your computer? I've no idea - I don't have access. You can try it out (after writing the code, you'll have your answer within 2 minutes). Or you can just dismiss out of hand, without knowing for sure. Your call - it's your problem after all.

      On your computer?

      No!

      #! perl -slw use strict; our $N //= 10e6; my %hash; my $val = 'AAAAA'; undef $hash{ $val++ } for 1 .. $N; my $start = time; open SORT, '| sort > sorted' or die $!; print SORT $_ while $_ = each %hash; close SORT; undef %hash; open IN, '<', 'sorted' or die $!; chomp(), $hash{ $_ } = $. while <IN>; close IN; my $elapsed = time - $start; printf "Took %.6f seconds for $N items (%.5f per second)\n", $elapsed, $elapsed / $N; __END__ C:\test>junk41 -N=10e6 Took 225.000000 seconds for 10e6 items (0.00002 per second)

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Not that this seems to help you very much but sort was my first instinct too. My rather dated Debian Squeeze box is an AMD 3800+ (64 bit dual core) with 2 gigs RAM and one 5 year old IDE drive can almost do it.

        $ perl 860849.pl Took 131.000000 seconds for 10000000 items (0.00001 per second)

        I think I could reach that 108 by putting the sorted file on a separate spindle with a thinner filesystem layer (e.g. no ext3 and LVM2).

        I'm really pointing this out because it's quite attainable with a reasonably recent hard drive. Maybe even just SATA/PATA would do the trick. SSD surely would.

Re^3: In-place sort with order assignment (-u)
by tye (Sage) on Sep 21, 2010 at 05:28 UTC

    My first attempt would be to skip the hash altogether and just run the 'keys' straight through "/usr/bin/sort -u" as they are extracted. But perhaps the hash is needed for other than eliminating duplicates. (Yes, I have "sort -u" on Windows -- it is just one of many useful things that came with cygwin.)

    - tye        

      Interesting thought. I have a long process running currently that I'd like to let finish without risking swapping/ time penalties, but I'll give it a go later.

      I do wonder if it isn't just moving the goal posts though. Instead of sorting 10e6 unique words, it would mean sorting 300e6 to 1e9 words and then discarding the vast majority of them.

      It would address the memory problem, but at N log N, with N*30 to n*100, the time cost might be prohibitive?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Is "/usr/bin/sort -u" really so dumb as to sort all of the duplicates and then only eliminate them at the end? I would be a bit surprised as well as sad if that were so. I would also be surprised if it used a hash or similar structure that tends to thwart "divide and conquer" strategies that end up being required for huge sorts like /usr/bin/sort is designed for.

        I'd expect a modified sorting algorithm that eliminates duplicates whenever they "touch". But I've also long wondered why I have yet to run into any discussions of a formal algorithm like that (not having searched for such). My somewhat wild guess is that the performance of such an algorithm would be closer to O(U*log(U)) than to O(D*log(D)) (U= number of unique items, D= total number of items including duplicates). A lower bound would be O(D+U*log(U)) and I suspect that is a pretty close bound.

        As an aside, even with no customization of the sorting algorithm, a huge number of duplicates would speed up worst-case performance. Consider a huge list of just one item repeated over and over. Sorting it is O(N) because you need so many fewer comparisons because two things being equal means that they aren't "out of order" and so no further checking is required to pin down where each belongs in the output.

        That makes a somewhat interesting case that can be estimated via observation. Generate huge lists of duplicates of equal size but with much different quantities of duplication and compare the run-times of '/usr/bin/sort' and '/usr/bin/sort -u' against them. Repeat for a wide range of list sizes. Perhaps also throw in Windows' sort.exe which doesn't have -u and might lack some optimizations related to duplicates? (No, I don't plan to spend any time running such an experiment any time soon.)

        - tye