in reply to Re: Re: noncase-sensitive sorting
in thread noncase-sensitive sorting

That's educational. Thanks for running the numbers, Roger. I've tried to make it a habit to keep the complexity of the code inside a sort code-block to a minimum for efficiency reasons. Your results seem conclusive for cases with uc or lc. I am sure that the Schwartzian method would win if the complexity inside the sort function's code block got more intense.

Another factor that has some bearing on the comparative efficiency of the Schwartzian Transform versus a flat sort with additional complexity inside the sort code block is the issue of best case and worst case scenarios for the original unsorted list. In other words, a worst-case order of the original list may benefit more from the transform, while a best-case order for the original list may suffer from the transform.

The reason is that the two map's have a "fixed cost", whereas the code that appears within the code block of a sort function has a variable cost dependant on the original list's sort order.

The moral: If it really matters, benchmark it. ;)


Dave

Replies are listed 'Best First'.
Re: Re: Re: Re: noncase-sensitive sorting
by Roger (Parson) on Dec 02, 2003 at 07:25 UTC
    Hi davido, I was too quick to say that the Schwartzian is slower than simple sort. I just did another benchmark, by generating 1000 longer strings of up to 1000 characters (I had shorter strings of up to 10 characters before). And when I ran the script again, I got quite different results.

    Benchmark: timing 1000 iterations of Schwartzian, Simple... Schwartzian: 36 wallclock secs (35.58 CPU) @ 28.11/s (n=1000) Simple: 149 wallclock secs (149.35 CPU) @ 6.70/s (n=1000) Rate Simple Schwartzian Simple 6.70/s -- -76% Schwartzian 28.1/s 320% --
    So uc and lc simple sort (points well taken from Abigail-II) do become less efficient when the length of the strings to compare are very long. So benchmark doesn't always tell the story. So shall I rephrase the efficiency remark as 'it depends'? :-)

      So uc and lc do become less efficient when the length of the strings to compare are very long.

      No, uc and lc don't become 'less efficient'. They just need more time because the strings are longer, while the extra lookup ST does is independent of the length of the string.

      Note that a GRT is even faster than an ST:

      #!/usr/bin/perl use strict; use warnings; use Benchmark qw/timethese cmpthese/; my @letters = ('A' .. 'Z', 'a' .. 'z'); my $size = 1_000; my $max_l = 1_000; my @array = map {join "" => map {$letters [rand @letters]} 1 .. 1 + ra +nd $max_l} 1 .. $size; my (@st_sorted, @simple_sorted, @grt_sorted); cmpthese -10 => { Schwartzian => \&sort_schwartzian, Simple => \&sort_simple, GRT => \&sort_grt, }; sub sort_schwartzian { @st_sorted = map { $_ -> [0] } sort { $a -> [1] cmp $b -> [1]} map {[$_ => uc $_ ] } @array; } sub sort_simple { @simple_sorted = sort {uc ($a) cmp uc ($b)} @array; } sub sort_grt { @grt_sorted = map {substr $_ => $max_l} sort map {uc () . ("\0" x ($max_l - length)) . $_} @array; } die "Unequal" unless "@st_sorted" eq "@simple_sorted" && "@st_sorted" eq "@grt_sorted"; __END__ Rate Simple Schwartzian GRT Simple 8.99/s -- -78% -81% Schwartzian 40.7/s 353% -- -15% GRT 48.2/s 436% 18% --

      Abigail

        Thanks for the correction. I have to admit that I am ignorent of the GRT sort. I stared at the GRT sort for a while, and finally worked out what it's doing after sticking a print inside the map. I am writing down my finding on what is happenning inside a GRT sort here for the possible benefit to other monks.

        @grt_sorted = map {substr $_ => $max_l} sort map {uc () . ("\0" x ($max_l - length)) . $_} @array;
        The first map expands the original string with a 0 terminated string of width $max_l, concatinated with the original string itself. Eliminates the use of a slice in the Schwartzian sort.

        The sort follows only sorts the first part of the string terminated at \0.

        The second map then retrieves the original string with substr.

        Very clever indeed. :-D

        The tricky bit is the first map. I modified it to the following to inspect what's happenning inside the map -
        map { my $c = uc () . ("\0" x ($max_l - length)) . $_; print "$_ => $c\n"; $c } @array;