texuser74 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Friends

this script sorts an ascii text file.

@array=<IN>; close (IN); @sortedarray=sort(@array); foreach $line (@sortedarray){ print OUT $line; }

OUTPUT:
AAA
CCC
ZZZ
aaa
bbb

but i want a noncase sensitive sorted output i.e.
AAA
aaa
bbb
CCC
ZZZ

Replies are listed 'Best First'.
Re: noncase-sensitive sorting
by Roger (Parson) on Dec 02, 2003 at 06:41 UTC
    You will have to read the documentation of sort carefully -
    perldoc -f sort
    Somewhere in the doc it says -
    # now case-insensitively @articles = sort {uc($a) cmp uc($b)} @files;
    I think you are getting lazy, texuser74. ;-)

      Thanks Roger

      in a manual i read the following commad, sort {uc($a) cmp uc($b)}(keys %IN)

      but really i don't understand about that keys, thatz why i posted here

      once again thanks for your help

        There's lots of discussion below that is interesting, but I'm guessing it's also pretty deep for (what seems to me) a relative "perl newbie" poster ...

        Efficiency aside (and in the spirit of 'how the hell do I get this to work'), why not use:

        @files = sort { lc $a cmp lc $b } @files;
        ???

        And as you build your perl skills, revisit the efficiencies of your code.

        Ok, I must be reading a different set of manual than yours. Anyway, that (keys %IN) returns an array a list containing the hash keys for %IN.

Re: noncase-sensitive sorting
by davido (Cardinal) on Dec 02, 2003 at 06:47 UTC
    You can put the uc or lc inside the code block of the sort function as Roger has demonstrated, and that's a perfectly legitimate solution. But there's a more efficient way; one that doesn't require a set of calls to lc for each sort comparison. The more efficient way is known as the Schwartzian Transform:

    my @sorted = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [ $_, uc $_ ] } @unsorted;

    The reason this method is often preferable is that the uc function is only called once per item being sorted, rather than each time that item is compared. Yes, there is some cost associated with the two map transforms, but that usually will be less than the cost of 'expensive' routines in the comparison function.


    Dave

      Hi davido, I have to disagree with you on the efficiency here, because I believe a simple sort is faster here. ;-)

      I quickly wrote the following example to test the efficiency of the two different sort.

      use strict; use Data::Dumper; use Benchmark qw/timethese cmpthese/; my @letters = 'A'..'Z'; my @array; # generate array of 1000 random length random upper/lower # case strings to feed into the test subs. foreach (0..1000) { my $c = $letters [rand 26]; $c = int(rand 2) ? uc $c : lc $c; push @array, $c x (1 + rand 10); } cmpthese( timethese(1000, { 'Schwartzian' => '&sort_schwartzian', 'Simple' => '&sort_simple', }) ); sub sort_schwartzian { my @sorted = map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [ $_, uc $_ ] } @array; } sub sort_simple { my @sorted = sort { uc($a) cmp uc($b) } @array; }
      And the output is -
      Benchmark: timing 1000 iterations of Schwartzian, Simple... Schwartzian: 16 wallclock secs (16.39 CPU) @ 61.01/s (n=1000) Simple: 11 wallclock secs (10.95 CPU) @ 91.32/s (n=1000) Rate Schwartzian Simple Schwartzian 61.0/s -- -33% Simple 91.3/s 50% --
      It shows that the simple sort is much faster than the Schwartzian sort, perhaps because of all the overheads the Schwartzian sort has. And it demonstrates that the perl built-in uc and lc functions are quite efficient. %^p

        It shows that the simple sort is much faster than the Schwartzian sort, perhaps because of all the overheads the Schwartzian sort has.

        Well, it shows that's the case with very little (and very simple) data. Instead of using little strings of a single character repeated up to ten times, try it with, say, 4k chunks of random character data. (I.e. use somthing like this to generate your @array:

        foreach (0..1000) { my $string; $string .= ('a'..'z','A'..'Z')[rand 52] for 1 .. 4096; push @array, $string; }
        With that change, I got these numbers:
        Benchmark: timing 1000 iterations of Schwartzian, Simple... Schwartzian: 67 wallclock secs (66.85 usr + 0.04 sys = 66.89 CPU) @ 1 +4.95/s (n= 1000) Simple: 534 wallclock secs (533.43 usr + 0.06 sys = 533.49 CPU) @ + 1.87/s ( n=1000) Rate Simple Schwartzian Simple 1.87/s -- -87% Schwartzian 14.9/s 698% --
        Does that show that the Scwartzian transform should definitely be used because it is 7 times as fast as without it? No. It shows, together with your results, that whether or not you choose to use an ST is (or should be) dependent both on the complexity of your sorting code and on the data you will be working with. It isn't black-or-white by any means.

        It also shows that both lc and uc are O(N). Try the whole exercise again but sort by length and you'll probably find that there is no point to an ST at all. (Perl's length() is O(1).)

        -sauoq
        "My two cents aren't worth a dime.";
        
        That's educational. Thanks for running the numbers, Roger. I've tried to make it a habit to keep the complexity of the code inside a sort code-block to a minimum for efficiency reasons. Your results seem conclusive for cases with uc or lc. I am sure that the Schwartzian method would win if the complexity inside the sort function's code block got more intense.

        Another factor that has some bearing on the comparative efficiency of the Schwartzian Transform versus a flat sort with additional complexity inside the sort code block is the issue of best case and worst case scenarios for the original unsorted list. In other words, a worst-case order of the original list may benefit more from the transform, while a best-case order for the original list may suffer from the transform.

        The reason is that the two map's have a "fixed cost", whereas the code that appears within the code block of a sort function has a variable cost dependant on the original list's sort order.

        The moral: If it really matters, benchmark it. ;)


        Dave

Using a hash
by l3nz (Friar) on Dec 02, 2003 at 09:21 UTC
    I often use a solution more or less like this, that is based on a hash where the key is the value to be sorted (and in this case is case-insensitive) and the value is the actual string to be printed. It's quite compact and it kills dupes for free. As in this case I don't want it to kill dupes, I append an unique counter to the key so that each key is guaranteed to be unique.

    Here's the code:

    use strict; my %hDat; my $d; my $n; map( ($hDat{lc $_ . $n++} = $_), <DATA> ); foreach $d (sort keys %hDat) { print $hDat{$d}; } __DATA__ a B c D E Aa Xx A
    And here's the result:
    a
    A
    Aa
    B
    c
    D
    E
    Xx
    
    You know, there is more than one way. :-)