A brief description of heaps
A heap can be thought of as a binary tree, but it's stored as an array (so it's very compact). The defining property of the heap is that at any subtree, the largest item in the tree is at the root, which makes a heap partially sorted. A full description of a heap is beyond the scope of this meditation, but for my purposes, the important properties of a heap are as follows.
Heaps are interesting for other reasons, and I'd encourage any monk to look into them further. They're especially good for a priority queue in which you're always interested in the biggest thing.
(Note that "heap" can also refer to an area of memory used for dynamic memory allocation.)
Half sorting
Here's my relatively efficient way to find the median of a set. (I say "my" way because I thought of it independently, though certainly not first.)
Compare this to a more mundane approach to medians: sort the entire set and then pull out the relevant value. By sorting only half the list, it should be possible to attain the median more efficiently than a full O(n log n) sort.
Testing the supposed efficiency gain
In the code below, I create a class for tied scalars which act like real scalars except that every FETCH() increments a counter. I feed lists of these scalars into median finding functions based on both sort and Heap::Simple and compare the number of accesses required by each algorithm. It's assumed that two accesses is one comparison performed by the algorithm. This way we can see which method does more comparisons.
use strict; use warnings; package Tie::Scalar::Count; use vars qw( $Count ); $Count = 0; sub TIESCALAR { bless \my $x, shift } sub STORE { ${ $_[0] } = $_[1] } sub FETCH { $Count++; return ${ $_[0] }; } package main; use Test::More 'no_plan'; use List::Util 'shuffle'; use Heap::Simple; sub scalars2counters { foreach my $m ( @_ ) { my $x = $m; tie $m, 'Tie::Scalar::Count'; $m = $x; } } sub median_sort { my @s = sort { $a <=> $b } @_; my $i = scalar @s / 2; return ( $i == int $i ) ? ( $s[ $i-1 ] + $s[ $i ] ) / 2 : $s[ int $i ]; } sub median_heap { my $heap; my $tied = !!(tied $_[0]); if ( $tied ) { $heap = Heap::Simple->new( elements => ['Method' => 'FETCH'] ) +; $heap->insert( map { tied $_ } @_ ); } else { $heap = Heap::Simple->new(); $heap->insert( @_ ); } my $i = scalar @_ / 2; for ( 0 .. int $i - 2 ) { $heap->extract_top; } if ( $i == int $i ) { my $v1 = $heap->extract_top; # $i - 1 my $v2 = $heap->extract_top; # $i if ( $tied ) { $v1 = $v1->FETCH(); $v2 = $v2->FETCH(); } return ( $v1 + $v2 ) / 2; } else { $heap->extract_top; # $i - 1 return $tied ? $heap->extract_top->FETCH() : $heap->extract_to +p; } } # Confirm that Tie::Scalar::Count works as advertised { tie my $foo, 'Tie::Scalar::Count'; is( $Tie::Scalar::Count::Count, 0, 'count is zero initially' ); $foo = 10; is( $Tie::Scalar::Count::Count, 0, 'count is zero after STORE' ); is( $foo, 10, 'FETCH works' ); is( $Tie::Scalar::Count::Count, 1, 'count is 1 after FETCH' ); } # Confirm that median_sort and median_heap get the right medians { my @odd = ( 1, 2, 3 ); my @even = ( 2, 4 ); is( median_sort( @odd ), 2, 'odd median_sort is correct' ); is( median_sort( @even ), 3, 'even median_sort is correct' ); TODO: { local $TODO = 'heap method not ready for a short list'; is( median_heap( @odd ), 2, 'odd median_heap is correct' ); } is( median_heap( @even ), 3, 'even median_heap is correct' ); my @bonus1 = shuffle 0 .. 100; my @bonus2 = shuffle 1 .. 100; is( median_heap( @bonus1 ), median_sort( @bonus1 ), 'larger odd list is correct' ); is( median_heap( @bonus2 ), median_sort( @bonus2 ), 'larger even list is correct' ); } # Check that scalars2counters works { my @scalars = 0 .. 10; scalars2counters( @scalars ); my $tied = grep { tied $_ } @scalars; is( $tied, scalar @scalars, 'all elements tied' ); } # Confirm that counting works when using median_sort { $Tie::Scalar::Count::Count = 0; my @m = shuffle 0 .. 10; scalars2counters( @m ); is( median_sort( @m ), 5, 'median_sort is correct' ); ok( $Tie::Scalar::Count::Count > scalar @m, "count ($Tie::Scalar::Count::Count) is > @{[ scalar @m ]}" ); } # Confirm that counting works when using median_heap { $Tie::Scalar::Count::Count = 0; my @m = shuffle 0 .. 10; scalars2counters( @m ); is( median_heap( @m ), 5, 'median_heap is correct' ); ok( $Tie::Scalar::Count::Count > scalar @m, "count ($Tie::Scalar::Count::Count) is > @{[ scalar @m ]}" ); } { $Tie::Scalar::Count::Count = 0; my @m = shuffle 0 .. 100; scalars2counters( @m ); my $heap_median = median_heap( @m ); my $heap_compares = $Tie::Scalar::Count::Count; $Tie::Scalar::Count::Count = 0; my $sort_median = median_sort( @m ); my $sort_compares = $Tie::Scalar::Count::Count; is( $heap_median, $sort_median, 'Heap answer same as sort' ); ok( $heap_compares < $sort_compares, "fewer heap compares ($heap_compares < $sort_compares)" ); }
Efficiency is not speed
The heap method wins. In my tests, it typically does the job with about 65% of the comparisons required by a full sort. This seems like a big win, but there's a catch.
Using Benchmark on these functions, the sort-based median comes out about three times faster for a list with millions of elements. The efficiency advantage of the heap can't beat the optimization advantage of sort. In my testing, I ran out of memory before I found a list long enough for the heap's efficiency to win out.
Anecdote
Years ago, I converted some sort-based median-finding code to use a heap instead. That was in C++, so the different implementations were on more of an equal footing. In arrays of two million floats, I got a performance gain, but it was small enough that it could have been a measurement error (I was using a stop watch).
Implementation matters
I tried this first with Heap::Binary instead of Heap::Simple, and it was actually less efficient than a simple sort. I never looked into why, but clearly not all heaps are created equal.
The heap method could probably be made faster if it were tailor made to the task at hand. Using the CPAN module, I was actually removing elements on the way to the median, but it's not necessary to do this. A custom implementation could walk down the array without having to shorten it. The custom implementation also wouldn't have the baggage of supporting the tied scalars I was using for testing. Finally, and perhaps most importantly, building a heap from unordered data can actually be done faster than O(n log n) using Tarjan's algorithm (I didn't know this before I started looking around Wikipedia for this writeup).
Last thoughts (for lack of a conclusion)
I wrote this up mostly as (what I hope is) an interesting way to look at a common problem. Because sort is so much faster, this won't be of much use to Perl programmers.
I also find this an interesting method to look at the efficiency of a piece of code over its speed. The code in question can stay a black box, and we can still get some idea of how much work it does on the data involved. Most of the time we'll be more interested in speed than efficiency, but the latter may be worthwhile for estimating how an algorithm will scale.
|
---|