After a discussion in the chatterbox, with input from tye and Fastolfe, I tried a few different ways of solving a certain kind of sorting problem.

I've got a number of programs that output alphabetically sorted lists, but often there's an "other" category that we want to list after all the "real" elements in the list.

I had a solution that worked, but it looks really ugly, so I tried out two other suggestions and benchmarked those suckers.

#!/usr/bin/perl -w use strict; use Benchmark; # Yeah, I'm using globals; this is to get around troubles # I was having since Benchmark uses eval ... use vars qw/@array %last/; # some values to put in there : mix cases to avoid "asciibetical" sort @array=qw(xavier colin melissa wally Edward joan Arlen George mohandas + ralph other); $last{other}=1; sub using_hash { defined($last{lc($a)}) cmp defined($last{lc($b)}) or lc($a) cmp lc +($b); } sub using_return { return 1 if lc($a) eq 'other'; return -1 if lc($b) eq 'other'; lc($a) cmp lc($b); } sub using_eq { ( (lc($a) eq 'other') cmp (lc($b) eq 'other') ) || lc($a) cmp l +c($b); } # for gosh sakes, use a sensible value here # I needed a large number on a 4 CPU system timethese (100000, {'use_hash'=> q{ my @sorted = sort using_hash @array; }, use_return=> q{ my @sorted = sort using_return @array }, use_eq=> q{ my @sorted = sort using_eq @array } } );

The results. Surprising (to me) is that my ugly using_return benchmarked the fastest (when I added lc(); going with "ASCIIbetical" sorting made it marginally slower than using_eq) using_hash is nice because it makes it easy to change the value you want placed last on the fly, but it came in third in the speed category. Not that it's *that much* slower, however.

use_eq: 24 wallclock secs (24.27 usr + 0.00 sys = 24.27 CPU) use_hash: 27 wallclock secs (27.68 usr + 0.00 sys = 27.68 CPU) use_return: 23 wallclock secs (22.97 usr + 0.00 sys = 22.97 CPU) $ ./sort_test.pl Benchmark: timing 100000 iterations of use_eq, use_hash, use_return .. use_eq: 24 wallclock secs (23.59 usr + 0.00 sys = 23.59 CPU) use_hash: 28 wallclock secs (27.75 usr + 0.00 sys = 27.75 CPU) use_return: 24 wallclock secs (22.86 usr + 0.00 sys = 22.86 CPU)

The differences *are* pretty minor, but I'd like to try them with larger lists, and also after thinking more about optimization.

Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Replies are listed 'Best First'.
RE: A discriminating sort (and some damned lies)
by Fastolfe (Vicar) on Nov 03, 2000 at 04:29 UTC
    I consider the differences between 'use_eq' and 'use_return' to be inconsequential. The hash version can be explained by the hash lookups required. This version also lets you use an arbitrary number of 'last' items, though. I would also use exists over defined.

      Yeah, obviously the difference isn't huge by any measure. The return s perhaps offer that minimal speed gain by potentially decreasing the number of operations in the sorting sub.

      Re: exists vs. defined. Clearly these test for different things (for onlookers, whether a key exists in a hash, and whether the value in the hash corresponding to the key has been defined), and there are obviously contexts in which it would be an error to confuse them. But if you keep a tight rein on your hash (as was done in the code above), there's no functional difference here : $last{"not other"} fed to either function is going to return a false value. I suppose that forcing you to keep a tight rein on your hash could be your worry, but if your code (not Fastolfe in particular, of course!) is creating hash keys that you're not aware of, you're probably in trouble anyway =)

      Or am I missing something? (does, e.g. calling defined $hash{$key} create an entry for $key in $hash)?

      Philosophy can be made out of anything. Or less -- Jerry A. Fodor

        Calling defined for a hash key doesn't create anything.

        There is a subtle difference internally between exists and defined. With exists, all Perl has to do is see if there's an entry in the hash table for that item. With defined, that value has to be inspected:

        Benchmark: timing 1000000 iterations of defined, exists... defined: 21 wallclock secs (21.36 usr + 0.00 sys = 21.36 CPU) exists: 19 wallclock secs (18.45 usr + 0.00 sys = 18.45 CPU)
        Using code like this:
        # %hash is a hash with values for one/two/tre/for sub t_exists { 1 if exists $hash{one}; 1 if exists $hash{two}; 1 if exists $hash{tre}; 1 if exists $hash{for}; 1 if exists $hash{xxx}; 1 if exists $hash{xxy}; 1 if exists $hash{xxz}; 1 if exists $hash{xyx}; } ...
        Granted, that's a million iterations, and the difference is only a couple of seconds. Hardly worth arguing over, but it's just a difference in the way of thinking about hashes.. In my head, what I'm looking for is the presence of a certain key in the hash (exists), not whether or not the value for that key in the hash is defined. You're right that functionally they're equivalent in this case.