Re^7: In-place sort with order assignment

Replies are listed 'Best First'.
Re^8: In-place sort with order assignment by BrowserUk (Patriarch) on Sep 19, 2010 at 19:19 UTC
Many thanks for the code because I had drawn a complete blank on the description. The problem with it is that (as currently implemented) it is quadratic in time: `C:\test>junk -N=100 Took 0.001339 seconds for 100 items (0.000013 per item) C:\test>junk -N=1000 Took 0.105176 seconds for 1000 items (0.000105 per item) C:\test>junk -N=10000 Took 11.326000 seconds for 10000 items (0.001133 per item) C:\test>junk -N=20000 Took 46.241000 seconds for 20000 items (0.002312 per item) C:\test>junk -N=30000 Took 105.498000 seconds for 30000 items (0.003517 per item) C:\test>junk -N=40000 Took 186.630000 seconds for 40000 items (0.004666 per item)` [download] That's using M as 10% of N. Which I project means over 24 hours for a million items and 4 days for 2 million. I appreciate that doing an insertion sort using splice can be improved upon using (say) a heap, but most of the modules implementing alternatives to perl's built-in data structures, tend to be implemented using objects wrapped over hashes or arrays, and so what you gain from a somewhat more intelligent DS, you loose from the calling overheads :( Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^9: In-place sort with order assignment by Limbic~Region (Chancellor) on Sep 20, 2010 at 01:16 UTC
BrowserUk, I have no idea how much additional memory Heap::Simple::XS uses under the covers, but the speed is dramatically faster than using splice with a binary search (above). #!/usr/bin/perl use strict; use warnings; use Heap::Simple::XS; use Time::HiRes qw/gettimeofday tv_interval/; my $items = $ARGV[0] \|\| 100; my $str = 'a'; my %hash = map {$str++ => undef} 1 .. $items; my $at_once = int($items * .10); my $heap = Heap::Simple::XS->new(order => "gt", elements => "Scalar", +max_count => $at_once); my ($cnt, $beg, %known) = ($at_once, [gettimeofday], ()); while (1) { while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } $heap->insert($key); } my $items = $heap->count; last if ! $items; %known = (); my $max = $cnt + $items; $known{$_} = $cnt-- for $heap->extract_all; $cnt = $max; $heap->clear; } my $elapsed = tv_interval($beg, [gettimeofday]); my $per = sprintf("%.7f", $elapsed / $items); print "Took $elapsed seconds for $items items ($per per item)\n"; __DATA__ C:\tmp>perl buk2.pl 100 Took 0.001999 seconds for 100 items (0.0000200 per item) C:\tmp>perl buk2.pl 1000 Took 0.021015 seconds for 1000 items (0.0000210 per item) C:\tmp>perl buk2.pl 10000 Took 0.241327 seconds for 10000 items (0.0000241 per item) C:\tmp>perl buk2.pl 100000 Took 3.375 seconds for 100000 items (0.0000338 per item) C:\tmp>perl buk2.pl 1000000 Took 48.25 seconds for 1000000 items (0.0000483 per item) [download] Cheers - L~R	[reply] [d/l]
Re^10: In-place sort with order assignment by BrowserUk (Patriarch) on Sep 20, 2010 at 07:40 UTC
I have no idea how much additional memory Heap::Simple::XS uses under the covers, For 1e6 items, the memory usage grows from 145MB to over 200MB, which for 10e6 items is going to push a 32-bit machine into swapping. That said, I think this memory usage may, in part at least, be due to a bug in this incarnation of the code. I cannot see what would prevent this loop copying everything from `%hash` into both `%known` and the heap? `while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } $heap->insert($key); }` [download] Overall, the approach used in the second snippet in Re^2: In-place sort with order assignment seems to be the best. It takes 8 seconds and very little extra memory for 1e6; versus 50 seconds and +25% for the heap. And it happily handles 10e6 in 108 seconds and under 2GB. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^11: In-place sort with order assignment by Limbic~Region (Chancellor) on Sep 20, 2010 at 13:18 UTC
Re^12: In-place sort with order assignment by BrowserUk (Patriarch) on Sep 20, 2010 at 13:39 UTC
Re^9: In-place sort with order assignment by Limbic~Region (Chancellor) on Sep 19, 2010 at 23:17 UTC
BrowserUk, Perhaps the following minor modifications are more to your liking: #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw/gettimeofday tv_interval/; my $items = $ARGV[0] \|\| 100; my $str = 'a'; my %hash = map {$str++ => undef} 1 .. $items; my $at_once = int($items * .10); my ($cnt, @bot_n, %known) = (0, (), ()); my $beg = [gettimeofday]; while (1) { while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } my $inserted; my ($min, $max) = (0, $#bot_n); while (1) { my $mid = $min + (($max - $min) / 2); last if ! defined $bot_n[$mid]; $key gt $bot_n[$mid] ? ($min = $mid + 1) : ($max = $mid - +1); last if $min > $max; } splice @bot_n, $min, 0, $key; pop @bot_n if @bot_n > $at_once; } last if ! @bot_n; %known = (); $known{$_} = ++$cnt for @bot_n; @bot_n = (); } my $elapsed = tv_interval($beg, [gettimeofday]); my $per = sprintf("%.7f", $elapsed / $items); print "Took $elapsed seconds for $items items ($per per item)\n"; __DATA__ C:\tmp>perl buk.pl 100 Took 0.003766 seconds for 100 items (0.0000377 per item) C:\tmp>perl buk.pl 1000 Took 0.054244 seconds for 1000 items (0.0000542 per item) C:\tmp>perl buk.pl 10000 Took 0.734375 seconds for 10000 items (0.0000734 per item) C:\tmp>perl buk.pl 100000 Took 12.515625 seconds for 100000 items (0.0001252 per item) C:\tmp>perl buk.pl 1000000 Took 394.375 seconds for 1000000 items (0.0003944 per item) [download] Cheers - L~R	[reply] [d/l]
Re^9: In-place sort with order assignment by Limbic~Region (Chancellor) on Sep 23, 2010 at 18:48 UTC
BrowserUk, I have already shared this with you on my scratch but I would like to leave it here as well for posterity purposes. It turns out that even without the overhead of function calls, a pure perl heap still doesn't compete with the binary search and splice. As tye pointed out in the CB as I was lamenting, count opcodes. My guess is that perl has very optimized C under the covers using memcpy and what not making it superior to a home grown heap. I do want to point out that some of the complexity in the example below is because the heap has been special cased to this specific problem (limiting size). A general purpose implementation would be more powerful though less complex due to abstraction. This means it would also be slower. #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw/gettimeofday tv_interval/; my %hash = construct_hash(); my ($cnt, $top_n, @heap, %known) = (0, int((keys %hash) * .10), (), () +); my $beg = [gettimeofday]; while (1) { my ($max_key, $max_pos) = ('', undef); while (my ($key, $val) = each %hash) { # Skip entries we already have assigned a value too next if defined $val; # If we know the value of this key, assign it and move one if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } # Record where the key is inserted my $new_pos; # Check to see if heap is full if (@heap == $top_n) { # Key is already bigger than the top-n most key next if $key gt $max_key; # Insert the value in the previous maximum position ($heap[$max_pos], $new_pos) = ($key, $max_pos); # Find new max key and pos $max_key = ''; my $beg = $max_pos % 2 ? (($max_pos - 1) / 2) : (($max_pos + - 2) / 2); for ($beg .. $#heap) { ($max_key, $max_pos) = ($heap[$_], $_) if $heap[$_] gt + $max_key; } } else { push @heap, $key; $new_pos = $#heap; ($max_key, $max_pos) = ($key, $#heap) if $key gt $max_key; } # Bubble up my $parent_pos = $new_pos % 2 ? ($new_pos - 1) / 2 : ($new_pos + - 2) / 2; while ($heap[$new_pos] lt $heap[$parent_pos] && $parent_pos > +-1) { $max_pos = $new_pos if $heap[$parent_pos] eq $max_key; ($heap[$new_pos], $heap[$parent_pos], $new_pos) = ($heap[$ +parent_pos], $heap[$new_pos], $parent_pos); $parent_pos = $new_pos % 2 ? ($new_pos - 1) / 2 : ($new_po +s - 2) / 2; } } last if ! @heap; %known = (); #@heap = sort @heap; #$known{$_} = ++$cnt for @heap; while (@heap) { my $key; if (@heap == 1 \|\| @heap == 3) { $key = shift @heap; } elsif (@heap == 2) { $key = $heap[0] lt $heap[1] ? (shift @heap) : (pop @heap); } else { # Record the top item $key = $heap[0]; # Replace the root with the last item from the last level $heap[0] = pop @heap; # Bubble down my $cur_pos = 0; my $child_pos = $heap[1] lt $heap[2] ? 1 : 2; while ($heap[$cur_pos] gt $heap[$child_pos]) { ($heap[$cur_pos], $heap[$child_pos], $cur_pos) = ($hea +p[$child_pos], $heap[$cur_pos], $child_pos); my ($l_pos, $r_pos) = ((($cur_pos * 2) + 1), (($cur_po +s * 2) + 2)); last if $l_pos > $#heap; if ($r_pos > $#heap) { $child_pos = $l_pos; next; } $child_pos = $heap[$l_pos] lt $heap[$r_pos] ? $l_pos : + $r_pos; } } $known{$key} = ++$cnt; } } my $elapsed = tv_interval($beg, [gettimeofday]); my $per = sprintf("%.7f", $elapsed / $cnt); print "Took $elapsed seconds for $cnt items ($per per item)\n"; sub construct_hash { my $items = $ARGV[0] \|\| 100; my $str = 'a'; my %hash = map {$str++ => undef} 1 .. $items; return %hash; } [download] Cheers - L~R	[reply] [d/l]