paragkalra, merlyn has demo'ed some advanced stuff. The
technique he used is called a Schwartzian Transform and of
course is named after him. This is a performance enhancement (not
a functional enhancement) over less complex techniques.
You asked: If somebody can walk me through the above lines.....
You seem interested and I'm gonna give you some hints, but first I think
you would be well advised to learn more about basic sorting. The following
code is a "simple" sort and will be slower than merlyn's code as the
number of lines get larger but it is functionally the same.
I'll walk you through this and then the more
advanced code will make more sense.
#!/usr/bin/perl -w
use strict;
my $column_number =3;
my $prev = "";
my @input_lines = <DATA>; #slurps DATA in all at once
@input_lines =
sort {my ($a_col) = (split(/\s+/,$a))[$column_number];
my ($b_col) = (split(/\s+/,$b))[$column_number];
$a_col cmp $b_col
}@input_lines;
foreach (@input_lines)
{
print unless $_ eq $prev;
$prev = $_;
}
__DATA__
FT9mWp d4fgMB gvZRJU XRRu0N X
4ewejk pFnjd4 ie0hZF pPipQJ A
4ewejk 4sqprx ie0hZF cqtexi 3
Fo1OKn qhZPvb qWZPrt ruBObs 2
First we want to get all the data lines. This can be done like above or
with a while{} loop and pushing input lines onto the @inputlines array.
Now we come to the sort. An important thing is to understand that what goes
into a sort{} is what comes out! We can sort all kinds of complex structures.
But what goes in, is what comes out is true. Here I assigned the output back
to the input, which is completely legal and common.
Sort works on some "magic variables", $a and $b. The sort function compares pairs
of "things", whatever they are that are coming in. And $a and $b are assigned to
those "thing pairs". Here we have just lines of text input. But this could
be something like rows of an Array of rows (an ArrayofArray) or
even other "things".
@inputlines is what comes in (and is what is going to come out of the sort),
but we have a
problem because we only want to sort the input lines based on a subset of the
line, the thing in the specified column. To get the right column, we use split() which generates a
list of tokens that were separated by whitespace. Now only one of these columns
is relevant. So this (split(/\s+/,$a))[$column_number] stuff is a "slice"
and says that I only want the one of the things from the split() operation - the stuff in a particular $column_number.
That is done and assigned twice once for the $a input and once for $b input. Then we
do a "cmp" which is an alpha comparison yielding a "less than, equal, greater
than" value.
The sort{} function uses this decision to order the inputlines.
Remember what I said before, what goes into a sort comes out of a sort, so
this decision affects the inputlines, not just this particular column!!
Then results are printed. Lines that are the same as previous lines are suppressed.
Now as you see if you're following me so far, every time I want to compare 2
input lines, I have to go through this split() and then slice stuff. What
merlyn's code does is precompute these comparison values once and then use
these precomputed results in the sort. The sort doesn't do any fewer compares,
but it spends less time figuring out exactly what to compare! The "cost" is building
an intermediate data structure with those values (which takes time and memory).
The Schwartzian Transform creates a new ArrayofArray, with
each "row" containing the original thing and then the things that are to be used
in the sort comparison function. That array is the "thing" that gets inputted into
the sort. After the sort is done, we don't need
these precomputed values any more, so the original input is re-constructed, ie the
precomputed things are thrown away.
Note that these map{},sort{}grep,map{} things process lists. Because the "line"
could get way long and it is easier to read, they are usually "stacked", read
this stack from the bottom up, actually right to left.
Sorry post got long, but I hope this helps you understand at least the general
idea of what is going on. I would recommend practicing more along the lines of my example before launching off into more complex techniques. |