I thought I would add to this snippet since I had to stare at the code
for far to long before I figured out how I might use it.
Okay so we know it is for sorting, but why is it useful? It is
a heck of a lot faster for sorting large data sets where the data has to be
manipulated before sorting. The example I came up with is this:
We have a huge list of names, and we want to sort them alphabetically
by last name, first name, then middle name. That doesnt sound too
bad. But the problem is that our database does not store last names, first names
or middle names seperately. Instead there is just a single name field like
"Arthur C. Clarke". So when we get all the data from the database first we
have to put the string in alphapbetical sort order (ie make "Arthur C. Clarke"
into "Clarke Arthur C.") then sort it.
Someone without Randal Schwartz's insight (namely me) would have done something like
this:
my @sorted = sort mysort @unsorted;
sub mysort
{
$a =~ m/(.*?)\s*(\S+)$/;
$aa = "$2 $1";
$b =~ m/(.*?)\s*(\S+)$/;
$bb = "$2 $1";
return $aa cmp $bb;
}
Now this would be bad for large data sets because each element in @unsorted
will get sent to mysort several time, so we are then performing the match
several times where we dont really need to.
So in order to only perform the match once we go through the array
and replaces each element with an array reference containing the original
element (dont loose the original!) and the new sortable string.
So the first step:
@unsorted = ("Larry Wall", "Arthur C. Clarke");
@a = map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] } @unsorted;
#
# now @a is (["Larry Wall", "Wall Larry"],
# ["Arthur C. Clarke", "Clarke Arthur C."])
#
Now we do a simple sort on the second element in @a:
@b = sort {$a->[1] cmp $b->[1]} @a;
And finally we turn the 2D array back into a sorted 1D array restoring
the original values:
@sorted = map {$_->[0]} @b;
If you cascade the function calls, then you remove the intermediate
steps of my @a, and @b and you get the Schwartzian Transform:
my @sorted =
map{ $_->[0] }
sort {$a->[1] cmp $b->[1]}
map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1" ] }
@unsorted;
Now that is pretty cool stuff, but how much of a time savings is this extra
work? Well, I did a small benchmarking test to mimic what a time savings
would under a large data set:
#!/usr/bin/perl
use Benchmark;
@unsorted = ("Larry Wall",
"Jane Sally Doe",
"John Doe",
"Morphius",
"Jane Alice Doe",
"Arthur C. Clarke");
timethese(100000, {
"schwartzian" => sub { my @sorted =
map{ $_->[0] }
sort {$a->[1] cmp $b->[1]}
map { m/(.*?)\s*(\S+)$/; [$_, "$2 $1"
+] }
@unsorted },
"sort routine" => sub { my @sorted = sort mysort @unsorted }
});
sub mysort
{
$a =~ m/(.*?)\s*(\S+)$/;
$aa = "$2 $1";
$b =~ m/(.*?)\s*(\S+)$/;
$bb = "$2 $1";
return $aa cmp $bb;
}
And the results:
Benchmark: timing 100000 iterations of schwartzian, sort routine...
schwartzian: 57 wallclock secs (53.66 usr + 0.30 sys = 53.96 CPU)
sort routine: 124 wallclock secs (122.48 usr + 0.41 sys = 122.89 CPU)
I hope this helps.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.