in reply to In-place sort with order assignment

There's a simple, quadratic, solution: repeatedly find the smallest key that hasn't gotten a value yet. Give it the next value.

But I would go a different way: write all the keys to a file, one key per line. Call sort(1). Read the sorted file. Assign $. to each value.

And I'd look into finding the sort order at the moment you're creating the huge hash.

  • Comment on Re: In-place sort with order assignment

Replies are listed 'Best First'.
Re^2: In-place sort with order assignment
by BrowserUk (Patriarch) on Sep 20, 2010 at 13:52 UTC
    And I'd look into finding the sort order at the moment you're creating the huge hash.

    The big hash is effectively uniq'ing millions from hundreds or thousands of millions of inputs. Those inputs are "words" extracted from "lines" read from an input file. The sort is to order those "words". There's no way to determine the ordering without sorting.

    But I would go a different way: write all the keys to a file, one key per line. Call sort(1). Read the sorted file. Assign $. to each value.

    Can you extract 10 million keys from a hash (without blowing memory); write them to a file; start an external process (without forcing the current process to be swapped out), that will read them from that file and write them to other files several times; before reading them back into memory from the file; all in 108 seconds?

    If so, you might be onto something.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Can you extract 10 million keys from a hash (without blowing memory); write them to a file; start an external process (without forcing the current process to be swapped out), that will read them from that file and write them to other files several times; before reading them back into memory from the file; all in 108 seconds?
      On my computer? No, I wouldn't even be able to read in 10 million keys without blowing memory.

      On your computer? I've no idea - I don't have access. You can try it out (after writing the code, you'll have your answer within 2 minutes). Or you can just dismiss out of hand, without knowing for sure. Your call - it's your problem after all.

        On your computer?

        No!

        #! perl -slw use strict; our $N //= 10e6; my %hash; my $val = 'AAAAA'; undef $hash{ $val++ } for 1 .. $N; my $start = time; open SORT, '| sort > sorted' or die $!; print SORT $_ while $_ = each %hash; close SORT; undef %hash; open IN, '<', 'sorted' or die $!; chomp(), $hash{ $_ } = $. while <IN>; close IN; my $elapsed = time - $start; printf "Took %.6f seconds for $N items (%.5f per second)\n", $elapsed, $elapsed / $N; __END__ C:\test>junk41 -N=10e6 Took 225.000000 seconds for 10e6 items (0.00002 per second)

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      My first attempt would be to skip the hash altogether and just run the 'keys' straight through "/usr/bin/sort -u" as they are extracted. But perhaps the hash is needed for other than eliminating duplicates. (Yes, I have "sort -u" on Windows -- it is just one of many useful things that came with cygwin.)

      - tye        

        Interesting thought. I have a long process running currently that I'd like to let finish without risking swapping/ time penalties, but I'll give it a go later.

        I do wonder if it isn't just moving the goal posts though. Instead of sorting 10e6 unique words, it would mean sorting 300e6 to 1e9 words and then discarding the vast majority of them.

        It would address the memory problem, but at N log N, with N*30 to n*100, the time cost might be prohibitive?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.