in reply to Re^4: In-place sort with order assignment
in thread In-place sort with order assignment

BrowserUk,
Hrm. It sounds really close if not identical to the one described by Corion. If I didn't have kids running around screaming "Daddy Daddy", I would code up a solution but a better description will have to do.

I am confident that, in addition to my hash, I can have two additional arrays with 1000 items each. On the first pass through the hash using each, you check to see if the current key in the hash is lt than the first element in the array. If yes, you unshift, if not you check the next item until you find the proper location and use splice. If you reach the end of the array and it is less than 1000, you push. At the end, you check if you have 1001 items and you pop. At the end of the first pass, you now know the first 1000 items. You begin your second pass. This time, when you encounter an item in your first array, you assign its appropriate value while simultaneously populating the 2nd array (using the last element of the first).

Now of course this is silly - there are much better data structures than an array but I hope you get the idea.

Cheers - L~R

  • Comment on Re^5: In-place sort with order assignment

Replies are listed 'Best First'.
Re^6: In-place sort with order assignment
by BrowserUk (Patriarch) on Sep 19, 2010 at 15:37 UTC

    I guess I'm being thick here but, at the end of your second pass, you have 2000 items in 2 arrays, and the original number in the hash. Then what?

    And, does each "pass" mean a pass over 1000 (or 2000) elements of the array(s)? Or a full (millions) pass over the keys of the hash?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      Ok, I must be the one being thick here so I will give you a real solution as soon as I finish lunch.

      Update:

      #!/usr/bin/perl use strict; use warnings; my %hash = map {$_ => undef} 'a' .. 'z'; my ($at_once, $cnt, @bot_n, %known) = (3, 0, (), ()); while (1) { while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } my $inserted; for (0 .. $#bot_n) { if ($key lt $bot_n[$_]) { splice @bot_n, $_, 0, $key; $inserted = 1; last; } } push @bot_n, $key if ! $inserted; pop @bot_n if @bot_n > $at_once; } last if ! @bot_n; %known = (); for (@bot_n) { $known{$_} = ++$cnt; } @bot_n = (); } use Data::Dumper; print Dumper(\%hash);

      As you can see, this works but could be greatly improved with a data structure other than an array to maintain order. I converted the 2nd array to a hash to avoid doing a binary search since American football is about to come on and Sunday is family day :-)

      Update 2: This requires N / M passes through the hash where N represents the total number of keys and M represents the maximum number of items to sort at once. Using an array and a hash to keep track of the current unknown bottom N is extremely inefficient (from runtime perspective) but data structures that support a faster run time also consume more memory. Finding a balance of speed and memory is left as an exercise for the reader :-)

      Cheers - L~R

        Many thanks for the code because I had drawn a complete blank on the description.

        The problem with it is that (as currently implemented) it is quadratic in time:

        C:\test>junk -N=100 Took 0.001339 seconds for 100 items (0.000013 per item) C:\test>junk -N=1000 Took 0.105176 seconds for 1000 items (0.000105 per item) C:\test>junk -N=10000 Took 11.326000 seconds for 10000 items (0.001133 per item) C:\test>junk -N=20000 Took 46.241000 seconds for 20000 items (0.002312 per item) C:\test>junk -N=30000 Took 105.498000 seconds for 30000 items (0.003517 per item) C:\test>junk -N=40000 Took 186.630000 seconds for 40000 items (0.004666 per item)

        That's using M as 10% of N.

        Which I project means over 24 hours for a million items and 4 days for 2 million.

        I appreciate that doing an insertion sort using splice can be improved upon using (say) a heap, but most of the modules implementing alternatives to perl's built-in data structures, tend to be implemented using objects wrapped over hashes or arrays, and so what you gain from a somewhat more intelligent DS, you loose from the calling overheads :(


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.