Re^24: Why is the execution order of subexpressions undefined?

Here is a perl program that does a fair shuffle on a 1 million line file in 13 seconds.

#! perl -sw
use strict;

$| = 1;

sub shuffle {
    my $ref = @_ == 1 ? $_[ 0 ] : \@_;
    my $n = @$ref;
    for( 0 .. $#$ref ) {
        my $p = $_ + rand( $n-- );
        my $t = $ref->[ $p ];
        $ref->[ $p ] = $ref->[ $_ ];
        $ref->[ $_ ] = $t;
    }
    return unless defined wantarray;
    return wantarray ? @{ $ref } : $ref;
}

warn 'Start   : '.localtime() .$/;
my @data = <>;

warn 'Read    : '.localtime() .$/;

shuffle \@data;

warn 'Shuffled: '.localtime().$/;

print for @data;

warn 'Written : '.localtime().$/;
__END__
P:\test>shuffleFile.pl  data\1millionlines.dat >nul
Start   : Fri Apr 15 19:41:44 2005
Read    : Fri Apr 15 19:41:53 2005
Shuffled: Fri Apr 15 19:41:55 2005
Written : Fri Apr 15 19:41:57 2005
[download]

Can Haskell do this? I believe the answer to be no, because the 1 million line file in my example was chosen because, when loaded, it is pushes the memory consumption (on my machine) very close to the physical limits.

The hand coded shuffle routine is used in preference to List::Util::shuffle, because it operated in-place rather than duplicating the list. This could equally well be an in-place qsort.

The equivalent Haskel program would be less efficient because:

It could not do the shuffle/sort in-place, as that involves side-effects.
It would therefore need to duplicate the data at least once, which given the parameters of the program above, would push my machine into swapping.
A Haskell equivalent would probably require twice or more times as much memory, as the only fair shuffle mechanism I have seen that doesn't operate in-place, uses a tree structure which must take up more storage than an array or list (I think).
Equally, the popular and celebrated Haskell qsort implementation uses huge amounts of ram as it divides and subdivides the input list into a zillion smaller lists during recursion. It's an extremely neat implementation and really quite fast on smallish lists, but efficiency isn't just about operations that run well within the limits, but also what happens when those limits are breached. And where those limits are.

Ram maybe cheap, but diskspace is even cheaper and the volumes of data being routinely processed are growing much faster than typical ram.

Haskell (and other FP langauges) that categorically deny the possibility of programming with side effects--anything that operates in-place--will always be disadvantaged with respect to dealing with large volumes of data that must be in memory for algorithms to work.

Perl is memory hungry, but also provides the tools that allow that hunger to be bypassed by using alternative techniques. Haskell doesn't provide this fallback position.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco.

Rule 1 has a caveat! -- Who broke the cabal?

Comment on Re^24: Why is the execution order of subexpressions undefined? Download Code

Replies are listed 'Best First'.
GHC shuffle more efficient than Perl5. by audreyt (Hermit) on Apr 22, 2005 at 14:23 UTC
So, I ported your Perl 5 program to Haskell, and benchmarked both against the `1millionlines.dat` generated with this: `for (1..1_000_000) { print int(rand(10)), $/ }` [download] Result on my FreeBSD 5.3 laptop is: Perl5: 4.619u 1.010s 0:05.89 95.4% 10+79736k 0+0io 0pf+0w GHC: 3.007u 0.038s 0:03.10 97.7% 323+359k 0+0io 0pf+0w Note the constant memory use by GHC. The Haskell source code is attached as below. Read more... (2 kB)	[reply] [d/l] [select]
Re: GHC shuffle more efficient than Perl5. by Anonymous Monk on Apr 22, 2005 at 19:17 UTC
I have added this code to the Haskell Wiki: http://haskell.org/hawiki/ImperativeHaskell	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Haskell and inplace sorting by audreyt (Hermit) on Apr 22, 2005 at 09:12 UTC
Greetings. I'd just like to point out that Haskell can very well do side-effects inside the IO monad, including in-place sorting, down to pointer arithmetic with C structures. Ever since the Monadic Revolution, side effects in Haskell has become composable and distinguished from pure values, but every bit as usable as the equivalent C code.	[reply]
Re^25: Why is the execution order of subexpressions undefined? by Anonymous Monk on Apr 18, 2005 at 01:23 UTC
But a shuffle is nothing but a mapping of array indicies to indicies. So you can eliminate all the memory churn associated with your update in place scheme. It'll be faster in Perl or your functional language of choice. The `shuffle` routine will, of course, have to be parameterized for the length of array you are using. Its definition is left as an exercise for the reader). `$size = @big_array for (0..$size) { print $big_array[shuffle($_,$size)]; }` [download]	[reply] [d/l] [select]
Re^26: Why is the execution order of subexpressions undefined? by BrowserUk (Patriarch) on Apr 18, 2005 at 01:32 UTC
You mean something like this? #! perl -slw use strict; $\|++; warn 'Start : '.localtime() .$/; my $data = do{ local $/; <> }; warn 'Read : '.localtime() .$/; my $lines = $data =~tr[\n][\n]; my $idx = chr(0) x ( 4 * $lines ); my( $i, $p ) = ( 1, 0 ); substr( $idx, 4$i++, 4 ) = pack 'N', $p while $p = 1+index $data, $/, $p; my $n = $lines; for my $p ( 0 .. $n-1 ) { my $next = $p + int rand( $n-- ); my $t = substr( $idx, $p4, 4 ); substr( $idx, $p4, 4 ) = substr( $idx, $next4, 4 ); substr( $idx, $next4, 4 ) = $t; } warn 'Shuffled: '.localtime().$/; printf substr( $data, unpack( 'N', substr( $idx, 4$_, 4 ) ), index( $data, $/, $_+1 ) - $_ + 1 ) for 0 .. $lines; warn 'Written : '.localtime().$/; [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply] [d/l]