http://qs1969.pair.com?node_id=199981

I started to write this as a reply to Random array sorting, a question asked by kidd. I realized I was addressing a larger issue and that it didn't really belong there even though the example I use is related. I believe this is my first meditation.

A couple years ago, when I was at a different company than I am now, one of my co-workers asked pretty much the same question kidd did. He needed to randomize lines in a file, wanted to do it with perl, and was looking for the best (most efficient) way. My reply was the same as merlyn's to kidd, "it's a FAQ." The lead toolsmith, however, suggested another answer so simple, elegant, and efficient that I was simply blown away. His not only benchmarked much better but would have won if we were playing golf too.

To put things in perspective, our toolsmith was no slouch. If you work on a unix platform, the chances are that you have run code written by him. He has spent time maintaining some significant GNU projects and has contributed substantially to many others.

The solution that he proposed for randomizing an array is quite beautiful. You may want to use it. Here's a word of caution: DON'T. It is, unfortunately, incorrect.

@random = sort { .5 <=> rand(1) } @array; # DO NOT USE THIS!

Even when I first saw this, something about it felt wrong. It was too simple. I pointed out that the behavior of qsort(), the C function which acts as the basis for Perl's sort, is undefined when the comparison function doesn't produce an ordering. That, however, didn't seem to be much of an argument as A) it appeared to work and B) we wanted the ordering to be undefined. As long as it didn't add or remove elements, it should work, right? I gave up. He was right. It was beautiful. I would never be the developer he was. As far as I know, his solution went into production code and may well remain there today.

Sometime after I left that company, the question came up again. I looked a little closer at that most elegant of solutions. I never really tested it for correctness. Here's what I found:

$ perl -le '@n=(1..3);for(1..100000){@m=sort{.5<=>rand}@n;$h{"@m"}++}f +or(sort{$h{$b}<=>$h{$a}}keys%h){print"$_: $h{$_}"}' 1 2 3: 24990 2 1 3: 24892 2 3 1: 12572 1 3 2: 12541 3 1 2: 12514 3 2 1: 12491
Notice how, of the six possible orderings, two of them came up twice as many times as any of the four others. Not very random, is it? The problem lies in the way a quicksort works. Determining the details is left as an exercise.

This story has several morals. Among them are "no one is infallible," "it isn't a solution until it has been thoroughly tested," and "elegance is worthless without correctness." Another lesson I learned was that, since Perl hides so much power under the surface, sometimes a simple solution isn't simple at all.

I'm wondering if any other monks have stories of great solutions that turned out to be subtly wrong. I offer this thread up as a place to collect them so that they may serve as reminders to ourselves and warnings to others.

Edit: Added readmore tags.

-sauoq
"My two cents aren't worth a dime.";

Replies are listed 'Best First'.
Re: When the Best Solution Isn't
by thraxil (Prior) on Sep 23, 2002 at 03:44 UTC

    interesting. the incorrect solution that your toolsmith provided has inspired me to come up with a variation which i think is correct:

    my @shuffled = map {$_->[0]} sort { $a->[1] <=> $b->[1]} map {[$_, rand(1)]} @array;

    kind of a 'randomized schwartzian transform' approach.

    it benchmarks slightly faster than the standard Fisher-Yates shuffle but not quite as fast as the broken version. mine and Fisher-Yates are both going to be linear with the size of the array and you really can't do any better than that and still have a correct solution.

    i'm honestly baffled as to why the broken one runs so quickly. my best guess is that getting different values on each comparison in some cases confuses qsort to the point where it thinks it's done before it really is. alternatively though it seems like if you get the wrong series of random numbers, it could make qsort take even longer than normal.

    anders pearson

      Caveat: both our solutions take memory corresponding to the size of the input array.
      #!/usr/bin/perl -w use strict; use Benchmark qw(cmpthese); sub xform { map {$_->[0]} sort { $a->[1] <=> $b->[1]} map {[$_, rand(1)]} @_; } sub slice { my @random; push @random, rand 1 for 0 .. $#_; @_[ sort { $random[$a] <=> $random[$b] } 0 .. $#_ ]; } my @array = 1 .. 1000; cmpthese(1000, { slice => sub { slice @array }, xform => sub { xform @array }, }); __END__ Benchmark: timing 1000 iterations of slice, xform... slice: 17 wallclock secs (16.96 usr + 0.08 sys = 17.04 CPU) @ 58 +.69/s (n=1000) xform: 28 wallclock secs (27.76 usr + 0.12 sys = 27.88 CPU) @ 35 +.87/s (n=1000) Rate xform slice xform 35.9/s -- -39% slice 58.7/s 64% --

      Makeshifts last the longest.

        A slightly faster solution (on my machine) uses in-place assignment to @_:
        sub shuffle { my @r = map rand, 0 .. $#_; @_[sort { $r[$a] <=> $r[$b] } 0 .. $#_] = @_; }

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      actually, something occurred to me in the shower this morning. my approach isn't linear, it should be O(NlogN) because of the sort. but Fisher-Yates is linear, so it should pretty much beat anything else, at least for large inputs. but it doesn't seem to. now i'm really confused.

      anders pearson

Re: When the Best Solution Isn't
by BrowserUk (Patriarch) on Sep 23, 2002 at 13:14 UTC

    That weird shuffle sort (or desort might be a better term) does amazingly well on speed, and surprisingly, under at least some circumstances can actually also produce all possible combinations. Analyzing the distribution shows it real weakness. The story being told by the standard deviation figures below.

    thraxil and aristotles corrected forms do much here, but at the expense of the loss of speed. I didn't see japhy's version till my tests were complete so I haven't included it.

    However, a well crafted Fischer-Yates shuffles out does both of the "random sort" solutions for sheer performance by a good margin, being twice as fast with with arrays of 1000 elements and nearly 5 times as fast once you get to 100,000 elements.

    It also seems to do the best on the distribution test, but that maybe simply down to the random number generator rather then the algorithm. That's too deep statistical voodoo for me to determine.

    Update: Forgot to mention, the Fischer-Yates is an in-place sort so minimum memory usage too.

    #!/usr/bin/perl -w use strict; use Benchmark qw(cmpthese); use Data::Dumper; sub xform { map {$_->[0]} sort { $a->[1] <=> $b->[1]} map {[$_, rand(1)]} @_; } sub slice { my @random; push @random, rand 1 for 0 .. $#_; @_[ sort { $random[$a] <=> $random[$b] } 0 .. $#_ ]; } sub shufl { $a = $_ + rand @_ - $_ and @_[$_, $a] = @_[$a, $_] for (0..$#_); return @_; } sub qshuf { sort { .5 <=> rand(1) } @_; } my @array = 1 .. 1000; cmpthese(10, { slice => sub { slice @array }, xform => sub { xform @array }, shufl => sub { shufl @array }, qshuf => sub { qshuf @array }, }); my (%buckets, %d, @temp);; my @set = qw(A B C D); for (1 .. 100_000 ) { $buckets{"@{[slice @temp=@set]}"}{slice}++; $buckets{"@{[xform @temp=@set]}"}{xform}++; $buckets{"@{[shufl @temp=@set]}"}{shufl}++; $buckets{"@{[qshuf @temp=@set]}"}{qshuf}++; } print "\npermutation | slice | xform | shufl | qshuf \n"; print "--------------------------------------------------\n"; for my $key (sort keys %buckets) { printf "%8.8s: | %4d | %4d | %4d | %4d\n", $key, $buckets{$key}{slice}, $buckets{$key}{xform}, $buckets{$key}{shufl}, $buckets{$key}{qshuf}; $d{slice}{Ex} += $buckets{$key}{slice}; $d{slice}{Ex2} += $buckets +{$key}{slice}**2; $d{xform}{Ex} += $buckets{$key}{xform}; $d{xform}{Ex2} += $buckets +{$key}{xform}**2; $d{shufl}{Ex} += $buckets{$key}{shufl}; $d{shufl}{Ex2} += $buckets +{$key}{shufl}**2; $d{qshuf}{Ex} += $buckets{$key}{qshuf}; $d{qshuf}{Ex2} += $buckets +{$key}{qshuf}**2; } print "---------------------------------------------------\n"; printf "Std. Dev. | %0.3f | %0.3f | %0.3f | %0.3f\n", sqrt( ($d{slice}{Ex2} - ($d{slice}{Ex}**2/24))/23 ), sqrt( ($d{xform}{Ex2} - ($d{xform}{Ex}**2/24))/23 ), sqrt( ($d{shufl}{Ex2} - ($d{shufl}{Ex}**2/24))/23 ), sqrt( ($d{qshuf}{Ex2} - ($d{qshuf}{Ex}**2/24))/23 ); __END__ C:\test>199981 Benchmark: timing 10000 iterations of qshuf, shufl, slice, xform... qshuf: 3 wallclock secs ( 3.04 usr + 0.01 sys = 3.04 CPU) @ 32 +84.07/s (n=10000) shufl: 217 wallclock secs (209.08 usr + 0.01 sys = 209.09 CPU) @ + 47.83/s (n=10000) slice: 435 wallclock secs (429.57 usr + 0.01 sys = 429.58 CPU) @ + 23.28/s (n=10000) xform: 716 wallclock secs (693.68 usr + 0.00 sys = 693.68 CPU) @ + 14.42/s (n=10000) Rate xform slice shufl qshuf xform 14.4/s -- -38% -70% -100% slice 23.3/s 61% -- -51% -99% shufl 47.8/s 232% 105% -- -99% qshuf 3284/s 22681% 14008% 6767% -- permutation | slice | xform | shufl | qshuf -------------------------------------------------- A B C D: | 4322 | 4277 | 4127 | 12320 A B D C: | 4127 | 4115 | 4143 | 6134 A C B D: | 4284 | 4185 | 4156 | 6430 A C D B: | 4246 | 4083 | 4272 | 3094 A D B C: | 4205 | 4192 | 4062 | 3167 A D C B: | 4182 | 4128 | 4125 | 1597 B A C D: | 4143 | 4287 | 4246 | 12478 B A D C: | 4146 | 4156 | 4154 | 6273 B C A D: | 4027 | 4133 | 4133 | 6354 B C D A: | 4171 | 4153 | 4163 | 3092 B D A C: | 4191 | 4128 | 4201 | 3170 B D C A: | 4187 | 4233 | 4143 | 1546 C A B D: | 4088 | 4163 | 4170 | 6217 C A D B: | 4044 | 4197 | 4127 | 3190 C B A D: | 4214 | 4228 | 4114 | 6261 C B D A: | 4169 | 4021 | 4260 | 3080 C D A B: | 4069 | 4075 | 4185 | 1480 C D B A: | 4120 | 4102 | 4185 | 1533 D A B C: | 4177 | 4151 | 4199 | 3037 D A C B: | 4248 | 4207 | 4198 | 1608 D B A C: | 4175 | 4252 | 4203 | 3087 D B C A: | 4135 | 4173 | 4198 | 1641 D C A B: | 4203 | 4157 | 4098 | 1620 D C B A: | 4127 | 4204 | 4138 | 1591 --------------------------------------------------- Std. Dev. | 70.671 | 64.372 | 50.640 | 3127.055

    Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!
      sub qshuf { sort { .5 <=> rand(1) } @_; } cmpthese(10, { slice => sub { slice @array }, xform => sub { xform @array }, shufl => sub { shufl @array }, qshuf => sub { qshuf @array }, }); Rate xform slice shufl qshuf xform 14.4/s -- -38% -70% -100% slice 23.3/s 61% -- -51% -99% shufl 47.8/s 232% 105% -- -99% qshuf 3284/s 22681% 14008% 6767% --
      Mindboggling. "qshuf", an N log N algorithm, shows to be 67 faster than any of linear algorithms, and you don't raise a flag?

      Your benchmark is extremely flawed. You're running your sort in void context. Perl, being smart, knows that sort in void context isn't going to do anything useful, so it's not going to work up a sweat, and just won't do anything.

      Benchmark: timing 10000 iterations of qshuf, shufl, slice, xform
      Hmmm, with a first argument of '10' to 'cmpthese'? Either your output doesn't match your program, or your Benchmark.pm module is different than mine.
        Mindboggling. "qshuf", an N log N algorithm, ...

        Hmm. N log N with N=4 == 2.408. How does that affect your boogle? The main part of the post was the statistical analysis, but you are correct, the void context does make a difference to the timing and the output posted was obviously an accumulation from 2 different runs of the program which had been modified in the interim.

        Hmmm, with a first argument of '10' to 'cmpthese'? Either your output doesn't match your program

        It was a long time ago, so I am guessing more than remembering, but I think what probably happened was that I wrote the benchmark code first, ran it (with the iters counter set to 10000) and c&p'd the output to the source file.

        Seeing that qshuf() seemed very fast, (I probably wasn't even aware that sort silently optimised for void context rather than doing something sensible like issuing the "Useless in a void context" warning as other parts of perl do). I then went on to add the statistical analysis, and dropped the iterations counter to 10 to save time while getting the formatting of the output right.

        Here is a an updated version (run on 5.6.1 as the original was):

        And by way of recompense, here's an updated version incorporating List::Util::shuffle run on 5.8

        Thanks for pointing out the discrepancies--even if it did take 2 years for collected populace to notice :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco.
        Rule 1 has a caveat! -- Who broke the cabal?
Re: When the Best Solution Isn't
by blakem (Monsignor) on Sep 23, 2002 at 03:01 UTC
    @random = sort { .5 <=> rand(1) } @array; # DO NOT USE THIS!
    Aside from not being a fair shuffle (8 paths mapped to 6 end states) I believe it can actually cause older perls to dump core.

    Sorting assumes a transitive ranking function... If A > B and B > C then A > C. Returning random numbers for your comparison results wont comply with this requirement, and can wreak havoc depending on the internal sorting implementation.

    -Blake