Re^3: Strategy for randomizing large files via sysseek

I think you missed my point on why Tie::File would be good for you. I see you found your solution already, so this is after the fact, but here you go anyway.
Tie::File allows you to take a large file and treat it as an array without pulling the entire file into memory. If I understand you correctly you need to make new files from the randomized lines of some really huge file, without messing up the order of the original huge file.
what could be easier than an array for this? As soon as you attach to your large file with Tie:File you instanly know the total number of lines. It is trivial to have perl return you a random number in between 0 and the total number of lines you have. Then simply write that line to your new file, and keep track of the line number(array element) because you don't want to use it again. Now the trick is in optimizing your random number generator function to return a "unique" random number each time, cause you don't want to sit there waiting for a line you haven't used yet, especially as over time the number of lines used out numbers the unused ones.
My suggestion is that you create your self a "Random Order List". Since you know the total number of lines you effectively have a sequence of numbers. All you really want to do is jumble that sequence up then write out the now random list "in order", this is typically called a "shuffle". Lets look at some code

 # @lines is the old unshuffled array(list of indexes to your Tie::Fil
+e array)
  @lines = 0..51;
#for you the sequence would be 1..(scalar @YourTieArray)
  @shuffled = ();  # an empty array
  $numlines = 52; #scalar @yourTieArray again
  while ($numlines > 0) {
    push(@shuffled, splice(@lines, (rand $numlines), 1));
    $numcards--;
  }
  # @shuffled contains the shuffled lines, @lines is empty
[download]

now my thinking says that shuffling a list of intergers, even if that list is big, will take less memory, and less time than having every line reprinted with a random token in front of it, because Once you have your shuffled list of array indexes all you have to do is loop through it and print the corresponding line from your Tie::File array.
A discussion of shuffling algorithms that contributed to my comment can be found here: http://c2.com/cgi/wiki?LinearShuffle
if you read this let me know if it made a time improvement or not.
Ketema

Comment on Re^3: Strategy for randomizing large files via sysseek Download Code

Replies are listed 'Best First'.
Re^4: Strategy for randomizing large files via sysseek by BrowserUk (Patriarch) on Sep 14, 2004 at 17:18 UTC
Now try it on a file with a million lines... Update: Apparently, this is seen as sarcasm or otherwise unhelpful; so here are some statistics: Using Tie:File and an inplace Fischer-Yates shuffle to sort various sized files: 100 lines: 58 milliseconds. 1,000 lines: 2 seconds 10,000 lines: 194 seconds 100,000 lines: after 3 1/2 hours of cpu I got sick of listening to the fan thrashing itself to death trying to keep the cpu cool, and aborted. 20, 000,000 (the OP's task): Probably best measured in half-lives of Plutonium. The test code should anyone wish to verify my figures. #! perl -slw use strict; use Tie::File; use Benchmark::Timer; our $N \|\|= 1000; sub shuffle { my $ref = @_ == 1 ? $_[ 0 ] : [ @_ ]; for( 0 .. $#$ref ) { my $p = $_ + rand( @{ $ref } - $_ ); @{ $ref }[ $_, $p ] = @{ $ref }[ $p, $_ ]; } return unless defined wantarray; return wantarray ? @{ $ref } : $ref; } open OUT, '>', 'junk.dat' or die $!; printf OUT "%030d\n", $_ for 0 .. $N; close OUT; my @lines; tie @lines, 'Tie::File', 'junk.dat'; my $T = new Benchmark::Timer; $T->start( "shuffle $N" ); shuffle \@lines; $T->stop( "shuffle $N" ); $T->report; [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^5: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 15, 2004 at 13:02 UTC
I did a `time perl -pe '$r = substr(rand(), 2); $_ = "$r\t$_"' input \| sort -n +\| cut -f 2- > dev/null` [download] on the same files, and my results are: nr of lines real user sys 100 0m 0.010s 0m 0.010s 0m 0.000s 1000 0m 0.051s 0m 0.010s 0m 0.030s 10000 0m 0.264s 0m 0.180s 0m 0.040s 100000 0m 2.608s 0m 1.640s 0m 0.140s 1000000 0m40.640s 0m17.550s 0m 1.060s 10000000 17m14.639s 3m30.830s 0m27.200s	[reply] [d/l]
Re^6: Strategy for randomizing large files via sysseek by BrowserUk (Patriarch) on Sep 15, 2004 at 14:08 UTC
The post to which you've responded was meant to warn against using Tie::File for this kind of random rw access to very large files. For what is it good at, Tie:File is a brilliant module, I use it all the time--but this isn't what it is good at. In terms of your timings, they are pretty good. Here are some (rather crude) timing using a slightly corrected version of the code I posted elsewhere in this thread: 100 < 1 second 1,000 < 1 second 10,000 < 2 seconds 100,000 < 8 seconds 1,000,000 < 64 seconds 10,000,000 < 12m24secs I haven't got an accurate timing, but it did 1 billion records (32 GB) from/to compressed files in around 4 1/2 hours. Maximum memory used by any run is under 4 MB. That said, these were just a single pass, and as I pointed out elsewhere, you would probably need at least two runs to achieve a reasonable randomisation. Yours is much better in that respect I think--especially if the sort algorithm used is an unstable one. Strange to find an application that benefits from that. The OP also had the requirement to remove duplicate records from the input, which my approach won't achieve. That said, it would require two passes through the sort utility as well wouldn't it? One to remove the duplicates before prepending the random numbers and then re-sorting, so that would balance out. I did wonder whether instead of prepending a random number, sorting and then trimming, you could reverse the numbers, sort and re-reverse. That would randomise them pretty well for a one shot deal, but it isn't re-usable. Sorting on a randomly chosen character position in the records might also be an option. Saves prepending and cutting. Like all things, there are always several way, which is better often varies with the volumes involved, the tools available etc. The code and timings. Read more... (3 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^7: Strategy for randomizing large files via sysseek by ketema (Scribe) on Sep 15, 2004 at 18:40 UTC
Re^8: Strategy for randomizing large files via sysseek by BrowserUk (Patriarch) on Sep 16, 2004 at 08:34 UTC