Re^2: Emulating 'sort' command through a Perl

Hi Randal,

Thanks a bunch..

Today I have attained the "Nirvana" because of you.

My primary interests being Electronics, Physics, Movies, Novels & Linux, I generally don't read books on software engineering. Most of the times I do Googly.

However your's 'Learning Perl' is the only computer related book which I have read in my life. I just finished it few weeks back.

And I am not exaggerating it but I finished it like a Novel. It was so easy and simple to understand.

Its because of your book I have started loving & enjoying Perl. Great Job Randal!!! I do owe many things to you...

I always wanted to meet you and with you replying to my post, my "Karma" has been blessed with gracious bliss....Lucky me...

Coming to your piece of code....It worked like a breeze....All I can say is that - I am speechless

But being frank & forthright, I couldn't understand it much :( . This is the most strangest & strongest Perl code I have ever seen with hardly any use of semicolons.

My head is spinning between sort & map operators. Also I was not knowing 'cmp' operator as I had skip the section 13.4 of your book... :(

I have never used the split operator in the fashion which you have used. I am not sure what '$_->[0], $->1 are doing'. I think they have something to do with references.

If somebody can walk me through the above lines - explaining through one liner comments, I would be grateful...

Comment on Re^2: Emulating 'sort' command through a Perl

Replies are listed 'Best First'.
Re^3: Emulating 'sort' command through a Perl by Marshall (Canon) on Oct 30, 2009 at 19:33 UTC
paragkalra, merlyn has demo'ed some advanced stuff. The technique he used is called a Schwartzian Transform and of course is named after him. This is a performance enhancement (not a functional enhancement) over less complex techniques. You asked: If somebody can walk me through the above lines..... You seem interested and I'm gonna give you some hints, but first I think you would be well advised to learn more about basic sorting. The following code is a "simple" sort and will be slower than merlyn's code as the number of lines get larger but it is functionally the same. I'll walk you through this and then the more advanced code will make more sense. `#!/usr/bin/perl -w use strict; my $column_number =3; my $prev = ""; my @input_lines = <DATA>; #slurps DATA in all at once @input_lines = sort {my ($a_col) = (split(/\s+/,$a))[$column_number]; my ($b_col) = (split(/\s+/,$b))[$column_number]; $a_col cmp $b_col }@input_lines; foreach (@input_lines) { print unless $_ eq $prev; $prev = $_; } __DATA__ FT9mWp d4fgMB gvZRJU XRRu0N X 4ewejk pFnjd4 ie0hZF pPipQJ A 4ewejk 4sqprx ie0hZF cqtexi 3 Fo1OKn qhZPvb qWZPrt ruBObs 2` [download] First we want to get all the data lines. This can be done like above or with a while{} loop and pushing input lines onto the @inputlines array. Now we come to the sort. An important thing is to understand that what goes into a sort{} is what comes out! We can sort all kinds of complex structures. But what goes in, is what comes out is true. Here I assigned the output back to the input, which is completely legal and common. Sort works on some "magic variables", $a and $b. The sort function compares pairs of "things", whatever they are that are coming in. And $a and $b are assigned to those "thing pairs". Here we have just lines of text input. But this could be something like rows of an Array of rows (an ArrayofArray) or even other "things". @inputlines is what comes in (and is what is going to come out of the sort), but we have a problem because we only want to sort the input lines based on a subset of the line, the thing in the specified column. To get the right column, we use split() which generates a list of tokens that were separated by whitespace. Now only one of these columns is relevant. So this `(split(/\s+/,$a))[$column_number]` stuff is a "slice" and says that I only want the one of the things from the split() operation - the stuff in a particular $column_number. That is done and assigned twice once for the $a input and once for $b input. Then we do a "cmp" which is an alpha comparison yielding a "less than, equal, greater than" value. The sort{} function uses this decision to order the inputlines. Remember what I said before, what goes into a sort comes out of a sort, so this decision affects the inputlines, not just this particular column!! Then results are printed. Lines that are the same as previous lines are suppressed. Now as you see if you're following me so far, every time I want to compare 2 input lines, I have to go through this split() and then slice stuff. What merlyn's code does is precompute these comparison values once and then use these precomputed results in the sort. The sort doesn't do any fewer compares, but it spends less time figuring out exactly what to compare! The "cost" is building an intermediate data structure with those values (which takes time and memory). The Schwartzian Transform creates a new ArrayofArray, with each "row" containing the original thing and then the things that are to be used in the sort comparison function. That array is the "thing" that gets inputted into the sort. After the sort is done, we don't need these precomputed values any more, so the original input is re-constructed, ie the precomputed things are thrown away. Note that these map{},sort{}grep,map{} things process lists. Because the "line" could get way long and it is easier to read, they are usually "stacked", read this stack from the bottom up, actually right to left. Sorry post got long, but I hope this helps you understand at least the general idea of what is going on. I would recommend practicing more along the lines of my example before launching off into more complex techniques.	[reply] [d/l] [select]
Re^4: Emulating 'sort' command through a Perl by paragkalra (Scribe) on Oct 30, 2009 at 20:28 UTC
Thanks Marshall for sharing your time & your wonderful notes...	[reply]
Re^5: Emulating 'sort' command through a Perl by Marshall (Canon) on Oct 31, 2009 at 10:02 UTC
You are most welcome! On another point...white space is the easiest thing that you can do to enhance code readability and understandability. Consider this re-formatted code with just white space added... Now it is clear that the "map{}sort{}map{}<>" "stack" feeds the "for" loop. The "for" is iterating on the list provided by the map { $_->[0] } which is the end result of the 4 lines in this map{}sort{}map{}<> "stack" (which is actually just one line). When that whole thing is nicely indented, it is easier to see. Human eyeballs work best vertically, not diagonally. `for ( map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [$_, (split)[$column_number] ] } <> ) { print unless $_ eq $prev; $prev = $_; }` [download] This "for" statement is the same as: `for(map{$_->[0]}sort{$a->[1]cmp$b->[1]}map{[$_,(split)[$column_number]]}<>)` what do you think is easier to read?	[reply] [d/l] [select]
Re^3: Emulating 'sort' command through a Perl by paragkalra (Scribe) on Oct 29, 2009 at 21:48 UTC
I think I have found 1 bug in the code. Suppose I am sorting a text file containing following rows by column 3 i.e numeric 2: FT9mWp<SPACE>d4fgMB<SPACE>gvZRJU<SPACE>XRRu0N 4ewejk<SPACE>pFnjd4<SPACE>ie0hZF<SPACE>pPipQJ 4ewejk<SPACE>4sqprx<SPACE>ie0hZF<SPACE>cqtexi Fo1OKn<SPACE>qhZPvb<SPACE>qWZPrt<SPACE>ruBObs The code should give the output: FT9mWp<SPACE>d4fgMB<SPACE>gvZRJU<SPACE>XRRu0N 4ewejk<SPACE>4sqprx<SPACE>ie0hZF<SPACE>cqtexi 4ewejk<SPACE>pFnjd4<SPACE>ie0hZF<SPACE>pPipQJ Fo1OKn<SPACE>qhZPvb<SPACE>qWZPrt<SPACE>ruBObs But it is giving: FT9mWp<SPACE>d4fgMB<SPACE>gvZRJU<SPACE>XRRu0N 4ewejk<SPACE>pFnjd4<SPACE>ie0hZF<SPACE>pPipQJ 4ewejk<SPACE>4sqprx<SPACE>ie0hZF<SPACE>cqtexi Fo1OKn<SPACE>qhZPvb<SPACE>qWZPrt<SPACE>ruBObs Difference can be found out through 2nd & 3rd cell values of 2nd column i.e numeric 1 of both the outputs.	[reply]
Re^4: Emulating 'sort' command through a Perl by Marshall (Canon) on Oct 30, 2009 at 18:08 UTC
I think you are confused with column numbering. In Perl, the first column is number 0 and the second is number 1, etc. The the output you show is what the above program does (sorts on ",~~4th~~3rd" column). If you want a different column number to sort by adjust the number accordingly. Remember that sort order is according to ASCII values, but the Unix utilities, Excel will also sort alphanumeric fields this same way. In short, check you "column math" the program works according to my testing.	[reply]
Re^5: Emulating 'sort' command through a Perl by Anonymous Monk on Oct 30, 2009 at 18:44 UTC
Yes I know the column here is 3rd. Let me explain the bug taking an example. Suppose if any 2 row values of 3rd column are same (like in this case ie0hZF) In that case it should sort by first column values of those 2 rows. Now what if those 2 values are also same as. (like in this case - 4ewejk) Then it will try to compare by respective 2nd column values and what if one of the value start with a digit (like in this case 4sqprx) and other with alphabet (like in this case pFnjd4) then number should come first bcoz its ascii value is less. Which is not happening here...	[reply]
Re^6: Emulating 'sort' command through a Perl by Marshall (Canon) on Oct 30, 2009 at 20:11 UTC