in reply to Re: noncase-sensitive sorting
in thread noncase-sensitive sorting

Hi davido, I have to disagree with you on the efficiency here, because I believe a simple sort is faster here. ;-)

I quickly wrote the following example to test the efficiency of the two different sort.

use strict; use Data::Dumper; use Benchmark qw/timethese cmpthese/; my @letters = 'A'..'Z'; my @array; # generate array of 1000 random length random upper/lower # case strings to feed into the test subs. foreach (0..1000) { my $c = $letters [rand 26]; $c = int(rand 2) ? uc $c : lc $c; push @array, $c x (1 + rand 10); } cmpthese( timethese(1000, { 'Schwartzian' => '&sort_schwartzian', 'Simple' => '&sort_simple', }) ); sub sort_schwartzian { my @sorted = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [ $_, uc $_ ] } @array; } sub sort_simple { my @sorted = sort { uc($a) cmp uc($b) } @array; }
And the output is -
Benchmark: timing 1000 iterations of Schwartzian, Simple... Schwartzian: 16 wallclock secs (16.39 CPU) @ 61.01/s (n=1000) Simple: 11 wallclock secs (10.95 CPU) @ 91.32/s (n=1000) Rate Schwartzian Simple Schwartzian 61.0/s -- -33% Simple 91.3/s 50% --
It shows that the simple sort is much faster than the Schwartzian sort, perhaps because of all the overheads the Schwartzian sort has. And it demonstrates that the perl built-in uc and lc functions are quite efficient. %^p

Replies are listed 'Best First'.
Re: Re: Re: noncase-sensitive sorting
by sauoq (Abbot) on Dec 02, 2003 at 11:08 UTC
    It shows that the simple sort is much faster than the Schwartzian sort, perhaps because of all the overheads the Schwartzian sort has.

    Well, it shows that's the case with very little (and very simple) data. Instead of using little strings of a single character repeated up to ten times, try it with, say, 4k chunks of random character data. (I.e. use somthing like this to generate your @array:

    foreach (0..1000) { my $string; $string .= ('a'..'z','A'..'Z')[rand 52] for 1 .. 4096; push @array, $string; }
    With that change, I got these numbers:
    Benchmark: timing 1000 iterations of Schwartzian, Simple... Schwartzian: 67 wallclock secs (66.85 usr + 0.04 sys = 66.89 CPU) @ 1 +4.95/s (n= 1000) Simple: 534 wallclock secs (533.43 usr + 0.06 sys = 533.49 CPU) @ + 1.87/s ( n=1000) Rate Simple Schwartzian Simple 1.87/s -- -87% Schwartzian 14.9/s 698% --
    Does that show that the Scwartzian transform should definitely be used because it is 7 times as fast as without it? No. It shows, together with your results, that whether or not you choose to use an ST is (or should be) dependent both on the complexity of your sorting code and on the data you will be working with. It isn't black-or-white by any means.

    It also shows that both lc and uc are O(N). Try the whole exercise again but sort by length and you'll probably find that there is no point to an ST at all. (Perl's length() is O(1).)

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Re: Re: noncase-sensitive sorting
by davido (Cardinal) on Dec 02, 2003 at 07:16 UTC
    That's educational. Thanks for running the numbers, Roger. I've tried to make it a habit to keep the complexity of the code inside a sort code-block to a minimum for efficiency reasons. Your results seem conclusive for cases with uc or lc. I am sure that the Schwartzian method would win if the complexity inside the sort function's code block got more intense.

    Another factor that has some bearing on the comparative efficiency of the Schwartzian Transform versus a flat sort with additional complexity inside the sort code block is the issue of best case and worst case scenarios for the original unsorted list. In other words, a worst-case order of the original list may benefit more from the transform, while a best-case order for the original list may suffer from the transform.

    The reason is that the two map's have a "fixed cost", whereas the code that appears within the code block of a sort function has a variable cost dependant on the original list's sort order.

    The moral: If it really matters, benchmark it. ;)


    Dave

      Hi davido, I was too quick to say that the Schwartzian is slower than simple sort. I just did another benchmark, by generating 1000 longer strings of up to 1000 characters (I had shorter strings of up to 10 characters before). And when I ran the script again, I got quite different results.

      Benchmark: timing 1000 iterations of Schwartzian, Simple... Schwartzian: 36 wallclock secs (35.58 CPU) @ 28.11/s (n=1000) Simple: 149 wallclock secs (149.35 CPU) @ 6.70/s (n=1000) Rate Simple Schwartzian Simple 6.70/s -- -76% Schwartzian 28.1/s 320% --
      So uc and lc simple sort (points well taken from Abigail-II) do become less efficient when the length of the strings to compare are very long. So benchmark doesn't always tell the story. So shall I rephrase the efficiency remark as 'it depends'? :-)

        So uc and lc do become less efficient when the length of the strings to compare are very long.

        No, uc and lc don't become 'less efficient'. They just need more time because the strings are longer, while the extra lookup ST does is independent of the length of the string.

        Note that a GRT is even faster than an ST:

        #!/usr/bin/perl use strict; use warnings; use Benchmark qw/timethese cmpthese/; my @letters = ('A' .. 'Z', 'a' .. 'z'); my $size = 1_000; my $max_l = 1_000; my @array = map {join "" => map {$letters [rand @letters]} 1 .. 1 + ra +nd $max_l} 1 .. $size; my (@st_sorted, @simple_sorted, @grt_sorted); cmpthese -10 => { Schwartzian => \&sort_schwartzian, Simple => \&sort_simple, GRT => \&sort_grt, }; sub sort_schwartzian { @st_sorted = map { $_ -> [0] } sort { $a -> [1] cmp $b -> [1]} map {[$_ => uc $_ ] } @array; } sub sort_simple { @simple_sorted = sort {uc ($a) cmp uc ($b)} @array; } sub sort_grt { @grt_sorted = map {substr $_ => $max_l} sort map {uc () . ("\0" x ($max_l - length)) . $_} @array; } die "Unequal" unless "@st_sorted" eq "@simple_sorted" && "@st_sorted" eq "@grt_sorted"; __END__ Rate Simple Schwartzian GRT Simple 8.99/s -- -78% -81% Schwartzian 40.7/s 353% -- -15% GRT 48.2/s 436% 18% --

        Abigail