Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to compute a median of a large number of numbers (10M+). Sorting takes me about 35s, but I want to be able to do it in at most 1/3 of that time. I know there are randomized algorithms and PDL has a median primitive, and there are other libraries on CPAN. What's the best way to go?

Replies are listed 'Best First'.
Re: need a faster median
by zentara (Cardinal) on Oct 18, 2011 at 17:39 UTC
    PDL is about as fast as C for number crunching, but then again, it is a big module to load. If you want the fastest speed possible, you should probably take the quickselect.c program discussed on fastest median possible and put it into an Inline::C routine.

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: need a faster median
by salva (Canon) on Oct 18, 2011 at 20:23 UTC
    use Sort::Key::Top:
    use Sort::Key::Top 'natpos'; use Benchmark qw(cmpthese); my @f = map rand, 0..1e7; my ($r1, $r2); cmpthese -1, { sort => sub { $r1 = (sort { $a <=> $b } @f)[@f/2] }, skt => sub { $r2 = natpos @f/2, @f } }; print "median $r1, $r2\n";
    on my computer:
    $ perl /tmp/median.pl s/iter sort skt sort 20.5 -- -94% skt 1.25 1542% -- mediam 0.499785519518444, 0.499785519518444
Re: need a faster median
by Corion (Patriarch) on Oct 18, 2011 at 16:32 UTC

    I would assume that the "best" way to go really depends on what you really need, and by how much the calculated median is allowed to deviate from the real median.

    You do realize that to find the real median, you always have to look at at least 50% of all elements in your set unless you do some preparation. If you can pre-sort your elements or use a heap, or a good estimate where your median will likely land, you might get away with throwing out more elements. But all of these approaches have drawbacks and you haven't stated what drawbacks are acceptable and which ones are not.

      I have a non-sorted array of "floats" of size 10M+. I have no more assumptions.

        I found that on my laptop, I can find the mode of 10 million floats in 13 seconds. Just keep a count of each value encountered and select the largest value. There's no need to sort it.

        Update: I forgot to mention that a hash is a good place to keep your counts...

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.