in reply to Re^7: In-place sort with order assignment
in thread In-place sort with order assignment

Many thanks for the code because I had drawn a complete blank on the description.

The problem with it is that (as currently implemented) it is quadratic in time:

C:\test>junk -N=100 Took 0.001339 seconds for 100 items (0.000013 per item) C:\test>junk -N=1000 Took 0.105176 seconds for 1000 items (0.000105 per item) C:\test>junk -N=10000 Took 11.326000 seconds for 10000 items (0.001133 per item) C:\test>junk -N=20000 Took 46.241000 seconds for 20000 items (0.002312 per item) C:\test>junk -N=30000 Took 105.498000 seconds for 30000 items (0.003517 per item) C:\test>junk -N=40000 Took 186.630000 seconds for 40000 items (0.004666 per item)

That's using M as 10% of N.

Which I project means over 24 hours for a million items and 4 days for 2 million.

I appreciate that doing an insertion sort using splice can be improved upon using (say) a heap, but most of the modules implementing alternatives to perl's built-in data structures, tend to be implemented using objects wrapped over hashes or arrays, and so what you gain from a somewhat more intelligent DS, you loose from the calling overheads :(


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^9: In-place sort with order assignment
by Limbic~Region (Chancellor) on Sep 20, 2010 at 01:16 UTC
    BrowserUk,
    I have no idea how much additional memory Heap::Simple::XS uses under the covers, but the speed is dramatically faster than using splice with a binary search (above).
    #!/usr/bin/perl use strict; use warnings; use Heap::Simple::XS; use Time::HiRes qw/gettimeofday tv_interval/; my $items = $ARGV[0] || 100; my $str = 'a'; my %hash = map {$str++ => undef} 1 .. $items; my $at_once = int($items * .10); my $heap = Heap::Simple::XS->new(order => "gt", elements => "Scalar", +max_count => $at_once); my ($cnt, $beg, %known) = ($at_once, [gettimeofday], ()); while (1) { while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } $heap->insert($key); } my $items = $heap->count; last if ! $items; %known = (); my $max = $cnt + $items; $known{$_} = $cnt-- for $heap->extract_all; $cnt = $max; $heap->clear; } my $elapsed = tv_interval($beg, [gettimeofday]); my $per = sprintf("%.7f", $elapsed / $items); print "Took $elapsed seconds for $items items ($per per item)\n"; __DATA__ C:\tmp>perl buk2.pl 100 Took 0.001999 seconds for 100 items (0.0000200 per item) C:\tmp>perl buk2.pl 1000 Took 0.021015 seconds for 1000 items (0.0000210 per item) C:\tmp>perl buk2.pl 10000 Took 0.241327 seconds for 10000 items (0.0000241 per item) C:\tmp>perl buk2.pl 100000 Took 3.375 seconds for 100000 items (0.0000338 per item) C:\tmp>perl buk2.pl 1000000 Took 48.25 seconds for 1000000 items (0.0000483 per item)

    Cheers - L~R

      I have no idea how much additional memory Heap::Simple::XS uses under the covers,

      For 1e6 items, the memory usage grows from 145MB to over 200MB, which for 10e6 items is going to push a 32-bit machine into swapping.

      That said, I think this memory usage may, in part at least, be due to a bug in this incarnation of the code.

      I cannot see what would prevent this loop copying everything from %hash into both %known and the heap?

      while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } $heap->insert($key); }

      Overall, the approach used in the second snippet in Re^2: In-place sort with order assignment seems to be the best. It takes 8 seconds and very little extra memory for 1e6; versus 50 seconds and +25% for the heap. And it happily handles 10e6 in 108 seconds and under 2GB.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk,
        I cannot see what would prevent this loop copying everything from %hash into both %known and the heap?

        next if defined $val; will skip any keys from %hash that we have previously assigned a value to.

        $hash{$key} = $known{$key};next; will assign any values we learned from the last run and then move on to the next record.

        $heap->insert($key); will only insert records into the heap for keys that we have not assigned a value to (either in a previous run or this run). Update: According to the documentation, max_count => $at_once will throw out items from the heap beyond that point. If that doesn't work as advertised, that may be the source of the additional memory.

        Cheers - L~R

Re^9: In-place sort with order assignment
by Limbic~Region (Chancellor) on Sep 19, 2010 at 23:17 UTC
    BrowserUk,
    Perhaps the following minor modifications are more to your liking:
    #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw/gettimeofday tv_interval/; my $items = $ARGV[0] || 100; my $str = 'a'; my %hash = map {$str++ => undef} 1 .. $items; my $at_once = int($items * .10); my ($cnt, @bot_n, %known) = (0, (), ()); my $beg = [gettimeofday]; while (1) { while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } my $inserted; my ($min, $max) = (0, $#bot_n); while (1) { my $mid = $min + (($max - $min) / 2); last if ! defined $bot_n[$mid]; $key gt $bot_n[$mid] ? ($min = $mid + 1) : ($max = $mid - +1); last if $min > $max; } splice @bot_n, $min, 0, $key; pop @bot_n if @bot_n > $at_once; } last if ! @bot_n; %known = (); $known{$_} = ++$cnt for @bot_n; @bot_n = (); } my $elapsed = tv_interval($beg, [gettimeofday]); my $per = sprintf("%.7f", $elapsed / $items); print "Took $elapsed seconds for $items items ($per per item)\n"; __DATA__ C:\tmp>perl buk.pl 100 Took 0.003766 seconds for 100 items (0.0000377 per item) C:\tmp>perl buk.pl 1000 Took 0.054244 seconds for 1000 items (0.0000542 per item) C:\tmp>perl buk.pl 10000 Took 0.734375 seconds for 10000 items (0.0000734 per item) C:\tmp>perl buk.pl 100000 Took 12.515625 seconds for 100000 items (0.0001252 per item) C:\tmp>perl buk.pl 1000000 Took 394.375 seconds for 1000000 items (0.0003944 per item)

    Cheers - L~R

Re^9: In-place sort with order assignment
by Limbic~Region (Chancellor) on Sep 23, 2010 at 18:48 UTC
    BrowserUk,
    I have already shared this with you on my scratch but I would like to leave it here as well for posterity purposes. It turns out that even without the overhead of function calls, a pure perl heap still doesn't compete with the binary search and splice. As tye pointed out in the CB as I was lamenting, count opcodes. My guess is that perl has very optimized C under the covers using memcpy and what not making it superior to a home grown heap.

    I do want to point out that some of the complexity in the example below is because the heap has been special cased to this specific problem (limiting size). A general purpose implementation would be more powerful though less complex due to abstraction. This means it would also be slower.

    #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw/gettimeofday tv_interval/; my %hash = construct_hash(); my ($cnt, $top_n, @heap, %known) = (0, int((keys %hash) * .10), (), () +); my $beg = [gettimeofday]; while (1) { my ($max_key, $max_pos) = ('', undef); while (my ($key, $val) = each %hash) { # Skip entries we already have assigned a value too next if defined $val; # If we know the value of this key, assign it and move one if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } # Record where the key is inserted my $new_pos; # Check to see if heap is full if (@heap == $top_n) { # Key is already bigger than the top-n most key next if $key gt $max_key; # Insert the value in the previous maximum position ($heap[$max_pos], $new_pos) = ($key, $max_pos); # Find new max key and pos $max_key = ''; my $beg = $max_pos % 2 ? (($max_pos - 1) / 2) : (($max_pos + - 2) / 2); for ($beg .. $#heap) { ($max_key, $max_pos) = ($heap[$_], $_) if $heap[$_] gt + $max_key; } } else { push @heap, $key; $new_pos = $#heap; ($max_key, $max_pos) = ($key, $#heap) if $key gt $max_key; } # Bubble up my $parent_pos = $new_pos % 2 ? ($new_pos - 1) / 2 : ($new_pos + - 2) / 2; while ($heap[$new_pos] lt $heap[$parent_pos] && $parent_pos > +-1) { $max_pos = $new_pos if $heap[$parent_pos] eq $max_key; ($heap[$new_pos], $heap[$parent_pos], $new_pos) = ($heap[$ +parent_pos], $heap[$new_pos], $parent_pos); $parent_pos = $new_pos % 2 ? ($new_pos - 1) / 2 : ($new_po +s - 2) / 2; } } last if ! @heap; %known = (); #@heap = sort @heap; #$known{$_} = ++$cnt for @heap; while (@heap) { my $key; if (@heap == 1 || @heap == 3) { $key = shift @heap; } elsif (@heap == 2) { $key = $heap[0] lt $heap[1] ? (shift @heap) : (pop @heap); } else { # Record the top item $key = $heap[0]; # Replace the root with the last item from the last level $heap[0] = pop @heap; # Bubble down my $cur_pos = 0; my $child_pos = $heap[1] lt $heap[2] ? 1 : 2; while ($heap[$cur_pos] gt $heap[$child_pos]) { ($heap[$cur_pos], $heap[$child_pos], $cur_pos) = ($hea +p[$child_pos], $heap[$cur_pos], $child_pos); my ($l_pos, $r_pos) = ((($cur_pos * 2) + 1), (($cur_po +s * 2) + 2)); last if $l_pos > $#heap; if ($r_pos > $#heap) { $child_pos = $l_pos; next; } $child_pos = $heap[$l_pos] lt $heap[$r_pos] ? $l_pos : + $r_pos; } } $known{$key} = ++$cnt; } } my $elapsed = tv_interval($beg, [gettimeofday]); my $per = sprintf("%.7f", $elapsed / $cnt); print "Took $elapsed seconds for $cnt items ($per per item)\n"; sub construct_hash { my $items = $ARGV[0] || 100; my $str = 'a'; my %hash = map {$str++ => undef} 1 .. $items; return %hash; }

    Cheers - L~R