Re: noncase-sensitive sorting

So uc and lc do become less efficient when the length of the strings to compare are very long.

No, uc and lc don't become 'less efficient'. They just need more time because the strings are longer, while the extra lookup ST does is independent of the length of the string.

Note that a GRT is even faster than an ST:

#!/usr/bin/perl
                                                           
use strict;
use warnings;

use Benchmark qw/timethese cmpthese/;
   
my @letters = ('A' .. 'Z', 'a' .. 'z');
   
my $size  = 1_000;
my $max_l = 1_000;

my @array = map {join "" => map {$letters [rand @letters]} 1 .. 1 + ra
+nd $max_l}
                                                           1 .. $size;
        
my (@st_sorted, @simple_sorted, @grt_sorted);
 
cmpthese -10 => {
   Schwartzian  => \&sort_schwartzian,
   Simple       => \&sort_simple,
   GRT          => \&sort_grt,
};

sub sort_schwartzian {
    @st_sorted =
        map  { $_ -> [0] }
        sort { $a -> [1] cmp $b -> [1]}
        map  {[$_ => uc $_ ] } @array; 
}  

sub sort_simple { 
    @simple_sorted = sort {uc ($a) cmp uc ($b)} @array;
}
                                                           
sub sort_grt {
    @grt_sorted =
        map {substr $_ => $max_l}
        sort
        map {uc () . ("\0" x ($max_l - length)) . $_} @array;
}
   
die "Unequal" unless "@st_sorted" eq "@simple_sorted" &&
                     "@st_sorted" eq "@grt_sorted";
__END__
              Rate      Simple Schwartzian         GRT
Simple      8.99/s          --        -78%        -81%
Schwartzian 40.7/s        353%          --        -15%
GRT         48.2/s        436%         18%          --
[download]

Abigail

Comment on Re: noncase-sensitive sorting Download Code

Replies are listed 'Best First'.
Re: Re: noncase-sensitive sorting by Roger (Parson) on Dec 02, 2003 at 11:34 UTC
Thanks for the correction. I have to admit that I am ignorent of the GRT sort. I stared at the GRT sort for a while, and finally worked out what it's doing after sticking a print inside the map. I am writing down my finding on what is happenning inside a GRT sort here for the possible benefit to other monks. `@grt_sorted = map {substr $_ => $max_l} sort map {uc () . ("\0" x ($max_l - length)) . $_} @array;` [download] The first map expands the original string with a 0 terminated string of width $max_l, concatinated with the original string itself. Eliminates the use of a slice in the Schwartzian sort. The sort follows only sorts the first part of the string terminated at \0. The second map then retrieves the original string with substr. Very clever indeed. :-D The tricky bit is the first map. I modified it to the following to inspect what's happenning inside the map - `map { my $c = uc () . ("\0" x ($max_l - length)) . $_; print "$_ => $c\n"; $c } @array;` [download]	[reply] [d/l] [select]
Re: noncase-sensitive sorting by Abigail-II (Bishop) on Dec 02, 2003 at 12:46 UTC
The sort follows only sorts the first part of the string terminated at \0. Eh, no, it doesn't (unless there's a bug ;-)). The sort compares the entire strings (well, up to the point they are different). The trick is that the original strings are prefixed (with a fixed length string) in such away that the prefix itself is enough to determine how to sort, unless the strings are equal (). Strings shorter than the maximum length string are padded with NUL characters because NULs sort before any other character, and normally 'a' is sorted before 'ab'. In fact, a GRT and a ST are doing to same trick, but where ST uses a 2 element array, one element the original value, the other the computed value, GRT combines both values into a single string. () Well, not quite. If a string would end in a NUL (which doesn't happen in the benchmark), it would give the same prefix as a similar string without the trailing NUL. Say for instance the max length is 4, and we have two strings, `"ab"` and `"ab\0"`. The both will have a `"ab\0\0"` prefix, but the strings used in the sort will differ: `"ab\0\0ab"` vs. `"ab\0\0ab\0"`. In this case, the `cmp` will reach the end of the entire string, even while the strings differ. It'll go beyond the prefix. Abigail	[reply]