in reply to Re^6: In-place sort with order assignment
in thread In-place sort with order assignment

BrowserUk,
Ok, I must be the one being thick here so I will give you a real solution as soon as I finish lunch.

Update:

#!/usr/bin/perl use strict; use warnings; my %hash = map {$_ => undef} 'a' .. 'z'; my ($at_once, $cnt, @bot_n, %known) = (3, 0, (), ()); while (1) { while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } my $inserted; for (0 .. $#bot_n) { if ($key lt $bot_n[$_]) { splice @bot_n, $_, 0, $key; $inserted = 1; last; } } push @bot_n, $key if ! $inserted; pop @bot_n if @bot_n > $at_once; } last if ! @bot_n; %known = (); for (@bot_n) { $known{$_} = ++$cnt; } @bot_n = (); } use Data::Dumper; print Dumper(\%hash);

As you can see, this works but could be greatly improved with a data structure other than an array to maintain order. I converted the 2nd array to a hash to avoid doing a binary search since American football is about to come on and Sunday is family day :-)

Update 2: This requires N / M passes through the hash where N represents the total number of keys and M represents the maximum number of items to sort at once. Using an array and a hash to keep track of the current unknown bottom N is extremely inefficient (from runtime perspective) but data structures that support a faster run time also consume more memory. Finding a balance of speed and memory is left as an exercise for the reader :-)

Cheers - L~R

Replies are listed 'Best First'.
Re^8: In-place sort with order assignment
by BrowserUk (Patriarch) on Sep 19, 2010 at 19:19 UTC

    Many thanks for the code because I had drawn a complete blank on the description.

    The problem with it is that (as currently implemented) it is quadratic in time:

    C:\test>junk -N=100 Took 0.001339 seconds for 100 items (0.000013 per item) C:\test>junk -N=1000 Took 0.105176 seconds for 1000 items (0.000105 per item) C:\test>junk -N=10000 Took 11.326000 seconds for 10000 items (0.001133 per item) C:\test>junk -N=20000 Took 46.241000 seconds for 20000 items (0.002312 per item) C:\test>junk -N=30000 Took 105.498000 seconds for 30000 items (0.003517 per item) C:\test>junk -N=40000 Took 186.630000 seconds for 40000 items (0.004666 per item)

    That's using M as 10% of N.

    Which I project means over 24 hours for a million items and 4 days for 2 million.

    I appreciate that doing an insertion sort using splice can be improved upon using (say) a heap, but most of the modules implementing alternatives to perl's built-in data structures, tend to be implemented using objects wrapped over hashes or arrays, and so what you gain from a somewhat more intelligent DS, you loose from the calling overheads :(


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      I have no idea how much additional memory Heap::Simple::XS uses under the covers, but the speed is dramatically faster than using splice with a binary search (above).
      #!/usr/bin/perl use strict; use warnings; use Heap::Simple::XS; use Time::HiRes qw/gettimeofday tv_interval/; my $items = $ARGV[0] || 100; my $str = 'a'; my %hash = map {$str++ => undef} 1 .. $items; my $at_once = int($items * .10); my $heap = Heap::Simple::XS->new(order => "gt", elements => "Scalar", +max_count => $at_once); my ($cnt, $beg, %known) = ($at_once, [gettimeofday], ()); while (1) { while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } $heap->insert($key); } my $items = $heap->count; last if ! $items; %known = (); my $max = $cnt + $items; $known{$_} = $cnt-- for $heap->extract_all; $cnt = $max; $heap->clear; } my $elapsed = tv_interval($beg, [gettimeofday]); my $per = sprintf("%.7f", $elapsed / $items); print "Took $elapsed seconds for $items items ($per per item)\n"; __DATA__ C:\tmp>perl buk2.pl 100 Took 0.001999 seconds for 100 items (0.0000200 per item) C:\tmp>perl buk2.pl 1000 Took 0.021015 seconds for 1000 items (0.0000210 per item) C:\tmp>perl buk2.pl 10000 Took 0.241327 seconds for 10000 items (0.0000241 per item) C:\tmp>perl buk2.pl 100000 Took 3.375 seconds for 100000 items (0.0000338 per item) C:\tmp>perl buk2.pl 1000000 Took 48.25 seconds for 1000000 items (0.0000483 per item)

      Cheers - L~R

        I have no idea how much additional memory Heap::Simple::XS uses under the covers,

        For 1e6 items, the memory usage grows from 145MB to over 200MB, which for 10e6 items is going to push a 32-bit machine into swapping.

        That said, I think this memory usage may, in part at least, be due to a bug in this incarnation of the code.

        I cannot see what would prevent this loop copying everything from %hash into both %known and the heap?

        while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } $heap->insert($key); }

        Overall, the approach used in the second snippet in Re^2: In-place sort with order assignment seems to be the best. It takes 8 seconds and very little extra memory for 1e6; versus 50 seconds and +25% for the heap. And it happily handles 10e6 in 108 seconds and under 2GB.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      Perhaps the following minor modifications are more to your liking:
      #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw/gettimeofday tv_interval/; my $items = $ARGV[0] || 100; my $str = 'a'; my %hash = map {$str++ => undef} 1 .. $items; my $at_once = int($items * .10); my ($cnt, @bot_n, %known) = (0, (), ()); my $beg = [gettimeofday]; while (1) { while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } my $inserted; my ($min, $max) = (0, $#bot_n); while (1) { my $mid = $min + (($max - $min) / 2); last if ! defined $bot_n[$mid]; $key gt $bot_n[$mid] ? ($min = $mid + 1) : ($max = $mid - +1); last if $min > $max; } splice @bot_n, $min, 0, $key; pop @bot_n if @bot_n > $at_once; } last if ! @bot_n; %known = (); $known{$_} = ++$cnt for @bot_n; @bot_n = (); } my $elapsed = tv_interval($beg, [gettimeofday]); my $per = sprintf("%.7f", $elapsed / $items); print "Took $elapsed seconds for $items items ($per per item)\n"; __DATA__ C:\tmp>perl buk.pl 100 Took 0.003766 seconds for 100 items (0.0000377 per item) C:\tmp>perl buk.pl 1000 Took 0.054244 seconds for 1000 items (0.0000542 per item) C:\tmp>perl buk.pl 10000 Took 0.734375 seconds for 10000 items (0.0000734 per item) C:\tmp>perl buk.pl 100000 Took 12.515625 seconds for 100000 items (0.0001252 per item) C:\tmp>perl buk.pl 1000000 Took 394.375 seconds for 1000000 items (0.0003944 per item)

      Cheers - L~R

      BrowserUk,
      I have already shared this with you on my scratch but I would like to leave it here as well for posterity purposes. It turns out that even without the overhead of function calls, a pure perl heap still doesn't compete with the binary search and splice. As tye pointed out in the CB as I was lamenting, count opcodes. My guess is that perl has very optimized C under the covers using memcpy and what not making it superior to a home grown heap.

      I do want to point out that some of the complexity in the example below is because the heap has been special cased to this specific problem (limiting size). A general purpose implementation would be more powerful though less complex due to abstraction. This means it would also be slower.

      #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw/gettimeofday tv_interval/; my %hash = construct_hash(); my ($cnt, $top_n, @heap, %known) = (0, int((keys %hash) * .10), (), () +); my $beg = [gettimeofday]; while (1) { my ($max_key, $max_pos) = ('', undef); while (my ($key, $val) = each %hash) { # Skip entries we already have assigned a value too next if defined $val; # If we know the value of this key, assign it and move one if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } # Record where the key is inserted my $new_pos; # Check to see if heap is full if (@heap == $top_n) { # Key is already bigger than the top-n most key next if $key gt $max_key; # Insert the value in the previous maximum position ($heap[$max_pos], $new_pos) = ($key, $max_pos); # Find new max key and pos $max_key = ''; my $beg = $max_pos % 2 ? (($max_pos - 1) / 2) : (($max_pos + - 2) / 2); for ($beg .. $#heap) { ($max_key, $max_pos) = ($heap[$_], $_) if $heap[$_] gt + $max_key; } } else { push @heap, $key; $new_pos = $#heap; ($max_key, $max_pos) = ($key, $#heap) if $key gt $max_key; } # Bubble up my $parent_pos = $new_pos % 2 ? ($new_pos - 1) / 2 : ($new_pos + - 2) / 2; while ($heap[$new_pos] lt $heap[$parent_pos] && $parent_pos > +-1) { $max_pos = $new_pos if $heap[$parent_pos] eq $max_key; ($heap[$new_pos], $heap[$parent_pos], $new_pos) = ($heap[$ +parent_pos], $heap[$new_pos], $parent_pos); $parent_pos = $new_pos % 2 ? ($new_pos - 1) / 2 : ($new_po +s - 2) / 2; } } last if ! @heap; %known = (); #@heap = sort @heap; #$known{$_} = ++$cnt for @heap; while (@heap) { my $key; if (@heap == 1 || @heap == 3) { $key = shift @heap; } elsif (@heap == 2) { $key = $heap[0] lt $heap[1] ? (shift @heap) : (pop @heap); } else { # Record the top item $key = $heap[0]; # Replace the root with the last item from the last level $heap[0] = pop @heap; # Bubble down my $cur_pos = 0; my $child_pos = $heap[1] lt $heap[2] ? 1 : 2; while ($heap[$cur_pos] gt $heap[$child_pos]) { ($heap[$cur_pos], $heap[$child_pos], $cur_pos) = ($hea +p[$child_pos], $heap[$cur_pos], $child_pos); my ($l_pos, $r_pos) = ((($cur_pos * 2) + 1), (($cur_po +s * 2) + 2)); last if $l_pos > $#heap; if ($r_pos > $#heap) { $child_pos = $l_pos; next; } $child_pos = $heap[$l_pos] lt $heap[$r_pos] ? $l_pos : + $r_pos; } } $known{$key} = ++$cnt; } } my $elapsed = tv_interval($beg, [gettimeofday]); my $per = sprintf("%.7f", $elapsed / $cnt); print "Took $elapsed seconds for $cnt items ($per per item)\n"; sub construct_hash { my $items = $ARGV[0] || 100; my $str = 'a'; my %hash = map {$str++ => undef} 1 .. $items; return %hash; }

      Cheers - L~R