chanv has asked for the wisdom of the Perl Monks concerning the following question:

Hello perl guru, I have a question on sorting. Browsing through the internet as well as here, I came across some code snippets on sorting. Finally, I did this (example from Tom C.) from his "FMTEYEWTK":
-- #!/usr/bin/perl my $str = `type unsort.txt`; $str = join("\n", map($_->[0], sort( {$a->[0] cmp $b->[0] } map( [$_, (split('|', $_))[-1] ], split(/\n/, $str) ) ) ) ); print $str; --
unsort.txt contains the following:
-- xyz4|1026|CR xyz3|3461|CR dert5|3251|PR rtyaq|251|PR dbca|583|PR xxxt|360|CR --
Output is the following which is correct:
dbca|583|PR dert5|3251|PR rtyaq|251|PR xxxt|360|CR xyz3|3461|CR xyz4|1026|CR
But if I change the sort fn to
sort( {$a->[1] cmp $b->[1] }
which I am sorting the list using the first field as the key, then I got the following output:
-- xyz4|1026|CR xyz3|3461|CR dert5|3251|PR rtyaq|251|PR dbca|583|PR xxxt|360|CR --
1. The above is not sorted at all. What did I do wrong?
2. What if I want to sort using field 1 and field 2 as keys. What do I have to do? I have been trying to find answers from the WWW but to no avail.
I appreaciate your help. TIA. Vincent

Replies are listed 'Best First'.
Re: sorting question
by BrowserUk (Patriarch) on Nov 21, 2002 at 06:32 UTC

    The sort function you have is not sorting by field at all. It is using the whole string. If you look at this line

    map( [$_, (split('|', $_))[-1] ],

    You'll see that it is constructing anonymous arrays where the first element in the array is the whole line ($_) and the second is the looks like it should be last pipe delimited field from that line (split('|', $_))[-1]. It isn't, but I'll get back to that.

    However, if the sort block, it is using the first of these two elements $_->[0] which is the whole string.

    For reasons I'll admit to not understanding split '|', 'abc|def|ghi' splits not on the pipe char, but every character? It therefore returns the list ('a', 'b', 'c', '|', 'e', 'f', '|', 'g', 'h', 'i')? This may be obvious to everyone else, but it fooled me.

    The upshot of that is that when you changed the sort block to be sort( {$a->[1] cmp $b->[1] } what you where actually sorting on was the last character in the string which is the same for every line, hence the apparantly random sort order.

    So, to correct that problem, I changed to split /\|/, $_. That means that the anonymous array passed to the sort will contain the string and the last field ([-1]), which isn't helpful if you want to sort by the first two. To address this we need to change that to a slice of the first two fields ([0,1]) and then modify the sort block to use these two fields. However as the second field is numeric, it makes the sort block a little more complicated. We need to compare the two first-fields first and if they are equal, then compare the two second-fields numerically. {$a->[1] cmp $b->[1] || $a->[2] <=> $b->[2]}.

    Putting that all together and restructuring the code in the way I find most readable, I get:

    $str = join("\n", map { $_->[0] } sort{ $a->[1] cmp $b->[1] || $a->[2] <=> $b->[2] } map { [$_, ( split /\|/, $_ )[0,1] ] } split/\n/, $str );

    As none of the first fields in the sample data are the same, the second field is never compared, but if more or different data is used, the above would handle it.

    Having worked through that, I get the sneaking suspicion that this is a very carefully constructed question of the homework variety?


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: sorting question
by kabel (Chaplain) on Nov 21, 2002 at 06:14 UTC
    for short: the split accepts a regexp as first parameter, wherein the | is a metasymbol, and you did NOT quote it properly ... ;)
    use strict; use warnings; use constant FIRST_FIELD => 1; use constant SECOND_FIELD => 2; use constant THIRD_FIELD => 3; my $str = ""; { local $/; $str = <DATA>; } $str = join "\n", map { $_->[0] } sort {$a->[THIRD_FIELD] cmp $b->[THIRD_FIELD] } map { [$_, split('\|', $_)] } split(/\n/, $str); print $str; __DATA__ xyz4|1026|CR xyz3|3461|CR dert5|3251|PR rtyaq|251|PR dbca|583|PR xxxt|360|CR
    note further that cmp does NOT work for numerics, or look for yourself what is going wrong :) you need the <=> operator to do numeric sorting in this way.
    and you saw the characters "ST" or "schwartzian transform" along with the code? ;)
    HTH

    UPDATE: added ST line
Re: sorting question
by jarich (Curate) on Nov 21, 2002 at 06:43 UTC
    Perhaps an explanation of how kabel's code works might help you understand where you went wrong. It's a little different from your version but they have similar principles behind them.
    $str = join "\n", map { $_->[0] } sort {$a->[THIRD_FIELD] cmp $b->[THIRD_FIELD] } map { [$_, split('\|', $_)] } split(/\n/, $str);

    So, how does the schwartzian transform work? First you'll have to realise that the code works from the bottom up, essentially, so this description will have to do so too.

    split(/\n/, $str);
    This is easy. $str is a huge string of lines joined with newlines. This splits it up so we have a list of lines.

    map { [$_, split('\|', $_)] } split(/\n/, $str);
    Well... split('\|', $_) breaks up a given line on the pipe (|) character. So "abc|123|cde" becomes a list of ("abc", 123, "cde");

    Now [$_, split('\|', $_)] creates a list with the original string ($_) and then our split up version. So we'd get: ("abc|123|cde", "abc", 123, "cde") from our line above. Map does this across all elements of the following list, so we end up with an list of lists; for example:

    ( ["abc|123|cde", "abc", 123, "cde"], ["def|222|eee", "def", 222, "eee +"])
    Easy so far.

    sort {$a->[THIRD_FIELD] cmp $b->[THIRD_FIELD] }
    This carries out our sort over the list we've created with the previous map. Note that we can pick which field to sort by, by changing the value. If you wanted to sort by several keys you might do something like:
    sort {$a->[THIRD_FIELD cmp $b->[THIRD_FIELD] || $a->[FIRST_FIELD cmp $b->[FIRST_FIELD]}
    cmp returns 0 if the compare is the same so then the other side of the or will be evalated.

    join "\n", map { $_->[0] }
    Remember that we have a list of lists at this point and the first element of each sublist is the original string. This map merely discards all the other elements of each sublist, keeping the original string. We then join them each with a newline.

    I hope that helps you understand how it's supposed to work.

    jarich

Re: sorting question
by pg (Canon) on Nov 21, 2002 at 06:40 UTC
    Made couple of changes. (One of those changes is to use '\|' instead of '|'):
    my $str = `type unsort.txt`; $str = join("\n", map($_->[0], sort( {$a->[1] <=> $b->[1] }#for alpha order change <=> to + cmp map( [$_, (split('\|', $_))[1] ],#change this number to +control which field to use, 0 for 1st field split(/\n/, $str) ) ) ) ); print $str;
    Explaination:

    1. the split breaks the content of the file, into lines. Now we have a list contains all the lines;
    2. the inner map creates a list of array ref, which refs an array of two elements, the first one holds the entire line, when the second one contains the sorting key. You can choose which field to be used as the key;
    3. the sort sorts by the 2nd element, which is the key we choosen;
    4. the outter map then creates a list contains only the whole lines;
    5. finally, the join put the line breaks back, and form a string.
Re: sorting question
by chanv (Initiate) on Nov 21, 2002 at 10:30 UTC
    Thank you all for helping me out. Immediately after I submitted my question, I browse through this site and came across some code snippet that is currently working for me. Here is what I have done:
    #!/usr/bin/perl while (<>) { $str .= $_; } @lines = split ("\n", $str); @sorted = sort fooSort @lines; foreach $line (@sorted) { print ("$line\n"); } sub fooSort { my @a = split '\|', $a; my @b = split '\|', $b; $a[0] cmp $b[0] || $b[1] <=> $a[1] }
    The above achieves what I wanted.
    I will also try out some of the suggestions given by others. Thank you again for your help.

    regards,
    vincent