in reply to Re: Re: *Fastest* way to print a hash sorted by value
in thread *Fastest* way to print a hash sorted by value

Feel free to "optimize" it.

He's gonna need some optimisations, or War & peace:)

The problem with this method is that you have to re-sort the keep array every time you add a new key.

If you are keeping the top 100 elements from 10_000_000, you are going to be sorting the 100 element array 9_999_901 times. Even with the existing 100 elements being already sorted, and despite 5.8's default mergesort and its ability to "takes advantage of pre-existing order", adding a single value to a pre-sorted array of one hundred elements, still seems to take an average of around 120 calls to the comparator.

use sort ''; print sort::current; mergesort @array = 1..99; $c=0; @b = sort{ $c++; $a <=> $b } @array, 0; print $c 113; $c=0; @b = sort{ $c++; $a <=> $b } @array, 100; print $c 99; $c=0; @b = sort{ $c++; $a <=> $b } @array, 50; print $c 127 $c=0; @b = sort{ $c++; $a <=> $b } @array, 49; print $c 127 $c=0; @b = sort{ $c++; $a <=> $b } @array, 75; print $c 127 $c=0; @b = sort{ $c++; $a <=> $b } @array, 12; print $c 122

I think this falls into tyes second category.

Using a binary chop to locate the insertion point will take at most 8 comparisons and an average of 5. Add a couple of pre-comparisons to save doing the chop when the element to be added is less than the smallest to date, or greater than the largest, and you can avoid needing to do the binary chop much of the time.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

Replies are listed 'Best First'.
Re: Re: Re: Re: *Fastest* way to print a hash sorted by value
by bart (Canon) on Aug 09, 2003 at 22:31 UTC
    He's gonna need some optimisations, or War & peace:)
    He must be a quick reader then.

    I timed the above program keeping 100 items which I think is reasonable. "AA" .. "ZZ" returns 676 items, thus that's about 575 possible loops that may use sort — which only actually happens if there's a new value in the current top 100. Eventually that should become quite rare. The program takes about 1/6th of a second on my system, a humble Pentium 500MHz. Doing the same for 1 million records will thus take about 6 minutes.

    p.s. The loop without the sort takes 50ms. Thus, the sort takes about 2.4 times as long as the loop itself, or 70% of the total running time.

      Your right Bart.

      When I did my testing, for simplicity I used an array rather than a hash. Or rather 3 arrays.

      1. One with the values in random order.
      2. One with the values pre-sorted ascending
      3. One with the values pre-sorted descending

      This, so as to get a feel for average and worst case performance.

      The numbers of sorts and comparisons I gave above were those for the worst case scenario where the 1_000_000 are pre-ordered ascending and every new value would require the keep array to be re-sorted. This is slow, as I indicated.

      The bit I missed with your code was that as the values are coming from a hash, they are being processed in "hash order", which of course is undefined, but does a pretty good job of approximating random.

      This means that the keep array only ends up getting sorted an average of around 900 times instead of the worse case of 999_000 times in the 100 from 1_000_000 case, with the obvious benefit on performance.

      Whilst the worst case scenario still exists, the chances of it happening naturally are pretty astronomical and can therefore be discounted. A quick test using your code on 'AAAAA' .. 'BAAAA' for 913_952 keys used 908 sorts and took around 45 seconds on my 233MHz. This compares pretty well with my code on a 1_000_000 element array taking 15 seconds given that mine doesn't have the extra level of indirection in the comparitor that it would need to retain the keys.

      I'd like to see a comparison of the two run on 10 million records as I don't think that they would scale linearly, but I agree that it makes the benefits of my algorithm look much less good. Leastwise while it is coded in pure perl.

      An XS or C implementation of bsearch() would improve things dramatically. Maybe I'll have a go. I think that it would be a useful addition to List::Util. What d'ya think?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.