Re^6: Strategy for randomizing large files via sysseek

The post to which you've responded was meant to warn against using Tie::File for this kind of random rw access to very large files. For what is it good at, Tie:File is a brilliant module, I use it all the time--but this isn't what it is good at.

In terms of your timings, they are pretty good. Here are some (rather crude) timing using a slightly corrected version of the code I posted elsewhere in this thread:

100 < 1 second
1,000 < 1 second
10,000 < 2 seconds
100,000 < 8 seconds
1,000,000 < 64 seconds
10,000,000 < 12m24secs

I haven't got an accurate timing, but it did 1 billion records (32 GB) from/to compressed files in around 4 1/2 hours. Maximum memory used by any run is under 4 MB.

That said, these were just a single pass, and as I pointed out elsewhere, you would probably need at least two runs to achieve a reasonable randomisation. Yours is much better in that respect I think--especially if the sort algorithm used is an unstable one. Strange to find an application that benefits from that.

The OP also had the requirement to remove duplicate records from the input, which my approach won't achieve. That said, it would require two passes through the sort utility as well wouldn't it? One to remove the duplicates before prepending the random numbers and then re-sorting, so that would balance out.

I did wonder whether instead of prepending a random number, sorting and then trimming, you could reverse the numbers, sort and re-reverse. That would randomise them pretty well for a one shot deal, but it isn't re-usable.

Sorting on a randomly chosen character position in the records might also be an option. Saves prepending and cutting.

Like all things, there are always several way, which is better often varies with the volumes involved, the tools available etc.

The code and timings.

#! perl -slw
use strict;
use List::Util qw[ shuffle ];

## omit the glob if your shell expands wildcards for you.
BEGIN{ @ARGV = shuffle map glob, @ARGV }

print ~~localtime;

my @temps;
open $temps[ $_ ], '> :raw', "tmp/$_.tmp" for 0 .. 99;

while( <> ) {
    printf { $temps[ rand 100 ] } $_;
}
print 'Temp files written';

do{ 
    close $temps[ $_ ]; 
    open $temps[ $_ ], '< :raw', "tmp/$_.tmp" 
        or die "tmp/$_.tmp : $!" ;
} for 0 .. 99;

open FINAL, '> :raw', "randomised.all" or die $!;

while( @temps ) {
    my $pick = int rand @temps;
    printf FINAL scalar readline( $temps[ $pick ] );
    if( eof $temps[ $pick ] ) {
        close $temps[ $pick ];
        undef $temps[ $pick ]; 
        splice @temps, $pick, 1;
    }
}
close FINAL;

print ~~localtime;
__END__
P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 100 > jun
+k

P:\test>389660 junk
Wed Sep 15 14:25:59 2004
Temp files written
Wed Sep 15 14:25:59 2004

P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 1000 > ju
+nk

P:\test>389660 junk
Wed Sep 15 14:26:08 2004
Temp files written
Wed Sep 15 14:26:08 2004

P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 10000 > j
+unk

P:\test>389660 junk
Wed Sep 15 14:26:17 2004
Temp files written
Wed Sep 15 14:26:18 2004

P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 100000 > 
+junk

P:\test>389660 junk
Wed Sep 15 14:26:26 2004
Temp files written
Wed Sep 15 14:26:33 2004

P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 1000000 >
+ junk

P:\test>389660 junk
Wed Sep 15 14:27:02 2004
Temp files written
Wed Sep 15 14:28:05 2004

P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 20000000 
+> junk
Terminating on signal SIGINT(2)

P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 10000000 
+> junk

P:\test>389660 junk
Wed Sep 15 14:30:17 2004
Temp files written
Wed Sep 15 14:42:40 2004
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re^6: Strategy for randomizing large files via sysseek Download Code

Replies are listed 'Best First'.
Re^7: Strategy for randomizing large files via sysseek by ketema (Scribe) on Sep 15, 2004 at 18:40 UTC
I guess the size of your file is imiting. I wrote the code below, and got much better times than the sarcastic guy, but still much too slow for you I think. I based mine on sorting the indexes only, not the entire content of the line. However this does not address your problem of also removing duplicates. I'd be interested in testing what I can up with for your data file. If you want to post it somewhere let me know, and I'll see if I can beat your 4 hour mark. use strict; use Tie::File; use Benchmark::Timer; our $N \|\|= 100000; sub shuffle { my $shuffled = (); my $numlines = scalar @{$_[0]}; while ($numlines > 0) { my $randomLineNum = int rand $numlines; push(@$shuffled, $randomLineNum); splice(@{$_[0]}, $randomLineNum, 1); $numlines--; } return $shuffled; } open OUT, '>', 'junk.dat' or die $!; printf OUT "%030d\n", $_ for 0 .. $N; close OUT; my @lines; tie @lines, 'Tie::File', 'junk.dat'; my @indexList = (0..scalar @lines); my $T = new Benchmark::Timer; $T->start( "shuffling $N lines" ); my $newOrder = shuffle scalar \@indexList; $T->stop( "shuffling $N lines" ); $T->start( "Writing New ordered File" ); open NEW, ">sortedJunk.dat"; foreach my $lineNum(@$newOrder){ print NEW $lines[$lineNum],"\n"; } $T->stop( "Writing New ordered File" ); $T->report(); [download] and my results are pretty fast: numLines,time,writingTime 100,281us total,15.222ms total 1000,1.480ms total,57.048ms total 10000,29.017ms total,559.466ms total 100000,1.922s total,5.969s total 1000000,256.212s total,154.443s total	[reply] [d/l]
Re^8: Strategy for randomizing large files via sysseek by BrowserUk (Patriarch) on Sep 16, 2004 at 08:34 UTC
Okay. You've avoided the major slowdown by not using Tie::File for both reading and writing--but there are still a problems with your approach. The first is that your shuffle is memory-based. That means that it will not scale beyond the point were your array of indexes is larger than memory. You probably think that by using an array of indices rather than an array of lines, you are saving large amounts of memory. This is not the case. A perl array uses at least 24 bytes of memory for every element of the array before you actually store anything in those elements. So, your array of indexes saves very little memory compared to storing the lines of the file. In the OP's case of at least 20 million lines, that is at least 20 million indexes x 24 bytes 480 MB. The second problem is your use of splice in your shuffle. As was brought home to me very clearly by Abigail in his post Be aware of splice, using splice in a shuffle endows it with quadratic performance at the C level. Being in C, it doesn't show up when dealing with smallish numbers (< 100_000) because the C code is so fast relative to the perl code that calls it, but; Once the numbers move above 100_000 that quadratic performance really begins to bite. See Abigail's post for the full SP on that one. I'd be interested in testing what I can up with for your data file. If you want to post it somewhere let me know, and I'll see if I can beat your 4 hour mark. The data file I used for the 1 billion lines test was a simple extension of that used above: `my $n=0; printf "%030d\n", $n++ while $n < 1E9;` [download] To test your code, just substitute that line into your program above; but be prepared for a looong wait :) The first long wait will come when the line: `my @indexList = (0..scalar @lines);` [download] is executed. In order for Tie::File to determine the size of the array (`scalar @lines`), it will have to read every line in the file. Whilst `wc -l` will do this for a 32 GB file in say 10 minutes, it will take Tie::File a great deal longer. This is because as well as reading every line of the file sequentially and discarding it--as wc -l would--, it also builds a hash of the offset of the start of each line. Each element of a hash is 21 bytes for each line. For 1 billion lines that amounts to at least 19.5 GB!. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]