comment on

paragkalra, merlyn has demo'ed some advanced stuff. The technique he used is called a Schwartzian Transform and of course is named after him. This is a performance enhancement (not a functional enhancement) over less complex techniques.

You asked: If somebody can walk me through the above lines.....

You seem interested and I'm gonna give you some hints, but first I think you would be well advised to learn more about basic sorting. The following code is a "simple" sort and will be slower than merlyn's code as the number of lines get larger but it is functionally the same. I'll walk you through this and then the more advanced code will make more sense.

#!/usr/bin/perl -w
use strict;

my $column_number =3;
my $prev = "";

my @input_lines = <DATA>;  #slurps DATA in all at once

@input_lines =    
        sort {my ($a_col) = (split(/\s+/,$a))[$column_number];
              my ($b_col) = (split(/\s+/,$b))[$column_number];
                  $a_col cmp $b_col
             }@input_lines;
                     
foreach (@input_lines)
{
   print unless $_ eq $prev;
   $prev = $_;
}   

__DATA__
FT9mWp d4fgMB gvZRJU XRRu0N X
4ewejk pFnjd4 ie0hZF pPipQJ A
4ewejk 4sqprx ie0hZF cqtexi 3
Fo1OKn qhZPvb qWZPrt ruBObs 2
[download]

First we want to get all the data lines. This can be done like above or with a while{} loop and pushing input lines onto the @inputlines array.

Now we come to the sort. An important thing is to understand that what goes into a sort{} is what comes out! We can sort all kinds of complex structures. But what goes in, is what comes out is true. Here I assigned the output back to the input, which is completely legal and common.

Sort works on some "magic variables", $a and $b. The sort function compares pairs of "things", whatever they are that are coming in. And $a and $b are assigned to those "thing pairs". Here we have just lines of text input. But this could be something like rows of an Array of rows (an ArrayofArray) or even other "things".

@inputlines is what comes in (and is what is going to come out of the sort), but we have a problem because we only want to sort the input lines based on a subset of the line, the thing in the specified column. To get the right column, we use split() which generates a list of tokens that were separated by whitespace. Now only one of these columns is relevant. So this (split(/\s+/,$a))[$column_number] stuff is a "slice" and says that I only want the one of the things from the split() operation - the stuff in a particular $column_number. That is done and assigned twice once for the $a input and once for $b input. Then we do a "cmp" which is an alpha comparison yielding a "less than, equal, greater than" value. The sort{} function uses this decision to order the inputlines. Remember what I said before, what goes into a sort comes out of a sort, so this decision affects the inputlines, not just this particular column!!

Then results are printed. Lines that are the same as previous lines are suppressed.

Now as you see if you're following me so far, every time I want to compare 2 input lines, I have to go through this split() and then slice stuff. What merlyn's code does is precompute these comparison values once and then use these precomputed results in the sort. The sort doesn't do any fewer compares, but it spends less time figuring out exactly what to compare! The "cost" is building an intermediate data structure with those values (which takes time and memory).

The Schwartzian Transform creates a new ArrayofArray, with each "row" containing the original thing and then the things that are to be used in the sort comparison function. That array is the "thing" that gets inputted into the sort. After the sort is done, we don't need these precomputed values any more, so the original input is re-constructed, ie the precomputed things are thrown away.

Note that these map{},sort{}grep,map{} things process lists. Because the "line" could get way long and it is easier to read, they are usually "stacked", read this stack from the bottom up, actually right to left.

Sorry post got long, but I hope this helps you understand at least the general idea of what is going on. I would recommend practicing more along the lines of my example before launching off into more complex techniques.

In reply to Re^3: Emulating 'sort' command through a Perl by Marshall
in thread Emulating 'sort' command through a Perl by paragkalra

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.