in reply to noncase-sensitive sorting

You can put the uc or lc inside the code block of the sort function as Roger has demonstrated, and that's a perfectly legitimate solution. But there's a more efficient way; one that doesn't require a set of calls to lc for each sort comparison. The more efficient way is known as the Schwartzian Transform:

my @sorted = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [ $_, uc $_ ] } @unsorted;

The reason this method is often preferable is that the uc function is only called once per item being sorted, rather than each time that item is compared. Yes, there is some cost associated with the two map transforms, but that usually will be less than the cost of 'expensive' routines in the comparison function.


Dave

Replies are listed 'Best First'.
Re: Re: noncase-sensitive sorting
by Roger (Parson) on Dec 02, 2003 at 07:07 UTC
    Hi davido, I have to disagree with you on the efficiency here, because I believe a simple sort is faster here. ;-)

    I quickly wrote the following example to test the efficiency of the two different sort.

    use strict; use Data::Dumper; use Benchmark qw/timethese cmpthese/; my @letters = 'A'..'Z'; my @array; # generate array of 1000 random length random upper/lower # case strings to feed into the test subs. foreach (0..1000) { my $c = $letters [rand 26]; $c = int(rand 2) ? uc $c : lc $c; push @array, $c x (1 + rand 10); } cmpthese( timethese(1000, { 'Schwartzian' => '&sort_schwartzian', 'Simple' => '&sort_simple', }) ); sub sort_schwartzian { my @sorted = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [ $_, uc $_ ] } @array; } sub sort_simple { my @sorted = sort { uc($a) cmp uc($b) } @array; }
    And the output is -
    Benchmark: timing 1000 iterations of Schwartzian, Simple... Schwartzian: 16 wallclock secs (16.39 CPU) @ 61.01/s (n=1000) Simple: 11 wallclock secs (10.95 CPU) @ 91.32/s (n=1000) Rate Schwartzian Simple Schwartzian 61.0/s -- -33% Simple 91.3/s 50% --
    It shows that the simple sort is much faster than the Schwartzian sort, perhaps because of all the overheads the Schwartzian sort has. And it demonstrates that the perl built-in uc and lc functions are quite efficient. %^p

      It shows that the simple sort is much faster than the Schwartzian sort, perhaps because of all the overheads the Schwartzian sort has.

      Well, it shows that's the case with very little (and very simple) data. Instead of using little strings of a single character repeated up to ten times, try it with, say, 4k chunks of random character data. (I.e. use somthing like this to generate your @array:

      foreach (0..1000) { my $string; $string .= ('a'..'z','A'..'Z')[rand 52] for 1 .. 4096; push @array, $string; }
      With that change, I got these numbers:
      Benchmark: timing 1000 iterations of Schwartzian, Simple... Schwartzian: 67 wallclock secs (66.85 usr + 0.04 sys = 66.89 CPU) @ 1 +4.95/s (n= 1000) Simple: 534 wallclock secs (533.43 usr + 0.06 sys = 533.49 CPU) @ + 1.87/s ( n=1000) Rate Simple Schwartzian Simple 1.87/s -- -87% Schwartzian 14.9/s 698% --
      Does that show that the Scwartzian transform should definitely be used because it is 7 times as fast as without it? No. It shows, together with your results, that whether or not you choose to use an ST is (or should be) dependent both on the complexity of your sorting code and on the data you will be working with. It isn't black-or-white by any means.

      It also shows that both lc and uc are O(N). Try the whole exercise again but sort by length and you'll probably find that there is no point to an ST at all. (Perl's length() is O(1).)

      -sauoq
      "My two cents aren't worth a dime.";
      
      That's educational. Thanks for running the numbers, Roger. I've tried to make it a habit to keep the complexity of the code inside a sort code-block to a minimum for efficiency reasons. Your results seem conclusive for cases with uc or lc. I am sure that the Schwartzian method would win if the complexity inside the sort function's code block got more intense.

      Another factor that has some bearing on the comparative efficiency of the Schwartzian Transform versus a flat sort with additional complexity inside the sort code block is the issue of best case and worst case scenarios for the original unsorted list. In other words, a worst-case order of the original list may benefit more from the transform, while a best-case order for the original list may suffer from the transform.

      The reason is that the two map's have a "fixed cost", whereas the code that appears within the code block of a sort function has a variable cost dependant on the original list's sort order.

      The moral: If it really matters, benchmark it. ;)


      Dave

        Hi davido, I was too quick to say that the Schwartzian is slower than simple sort. I just did another benchmark, by generating 1000 longer strings of up to 1000 characters (I had shorter strings of up to 10 characters before). And when I ran the script again, I got quite different results.

        Benchmark: timing 1000 iterations of Schwartzian, Simple... Schwartzian: 36 wallclock secs (35.58 CPU) @ 28.11/s (n=1000) Simple: 149 wallclock secs (149.35 CPU) @ 6.70/s (n=1000) Rate Simple Schwartzian Simple 6.70/s -- -76% Schwartzian 28.1/s 320% --
        So uc and lc simple sort (points well taken from Abigail-II) do become less efficient when the length of the strings to compare are very long. So benchmark doesn't always tell the story. So shall I rephrase the efficiency remark as 'it depends'? :-)