Re: Strategy for randomizing large files via sysseek

Random thoughts from Random_Walk

Do you have to

File1 --> Randomise --> File1'
File2 --> Randomise --> File2'
.
.
Filen --> Randomise --> Filen'
[download]

or more a case of

File1 --+              + Filea
File2 --+-> Randomise -+ Fileb
....    |              | ....
Filen --+              + Filez
[download]

How fast is the script that uses the randomised data capable of using it ? Do you really have to rewrite the entire files ? perhaps you can provide a script that returns random lines from them to STDOUT (and keeps a record so it does not give the same twice).

I guess your increasing mem usage comes from the record of which lines you already output. You could return some lines (perhaps 10%), then re-write the source files without the used up lines, rinse repeat...

You could clobber the start of the lines you have used with some unique token XXX-Clobbered-XXX then you do not need to store the used lines record in RAM, of course as you get past 50% your random line will hit a clobbered one more often than not so a re-write of the source data purging the used lines would still be required at some point.

You could put in more RAM or faster disk :)

Cheers,
R.

Comment on Re: Strategy for randomizing large files via sysseek Select or Download Code

Replies are listed 'Best First'.
Re^2: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 09, 2004 at 14:39 UTC
Original poster here. First reply: I can't imagine Tie::File can be any faster than using sysseek. That's pretty low-level already. Second reply: It's the second case: all files are randomized to produce many other files. I've even tried combining all input files into a big one to make handling things easier. Having 50 filehandles open on (relatively) smaller files seems to be a little faster however. This is going to be imported elsewhere, so I cannot use the STDOUT trick. I guess your increasing mem usage comes from the record of which lines you already output. You could return some lines (perhaps 10%), then re-write the source files without the used up lines, rinse repeat... Rewriting the source files could get pretty ugly, but that might be a possible solution. I think the memory must be leaking somewhere though, as even a giant hash should not be taking up that much memory. I hope. You could clobber the start of the lines you have used with some unique token XXX-Clobbered-XXX then you do not need to store the used lines record in RAM, of course as you get past 50% your random line will hit a clobbered one more often than not so a re-write of the source data purging the used lines would still be required at some point. I can't rewrite the original files - it would lose the fixed-record-ness. It would also mean that instead of a hash lookup, I would have to sysseek, sysread, check if it has XXX, and try again. The bottleneck is really speed at this point, the memory drain is just worrisome. You could put in more RAM or faster disk :) Faster disk, perhaps. I thought about splitting the files onto different physical disks, to make sysseek faster, but I don't have enough disks to really make a difference. These are big files: I think I did a rough calculation that I would need well ove 4 Gigs of RAM to store all the information in memory.	[reply]
Re^3: Strategy for randomizing large files via sysseek by Random_Walk (Prior) on Sep 09, 2004 at 15:18 UTC
How random do they have to be ? if you make $linesize * n == $blocksize then you can read a random block of n lines from each of the 50 file handles, randomise these lines in mem, write them out. You only have to mark entire blocks as used. The disk reads will be n times more efficient. Of course lines that were within n of each other will stand a good chance of ending up within n*25 after the randomisation. Perhaps it will be fast enough that you can run two passes and this will be good enough for engineering Cheers, R.	[reply]
Re^4: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 09, 2004 at 17:32 UTC
(OP) They are currently ordered, so they have to be truly random - no chunks allowed, unfortunately.	[reply]
Re^3: Strategy for randomizing large files via sysseek by ketema (Scribe) on Sep 14, 2004 at 16:19 UTC
I think you missed my point on why Tie::File would be good for you. I see you found your solution already, so this is after the fact, but here you go anyway. Tie::File allows you to take a large file and treat it as an array without pulling the entire file into memory. If I understand you correctly you need to make new files from the randomized lines of some really huge file, without messing up the order of the original huge file. what could be easier than an array for this? As soon as you attach to your large file with Tie:File you instanly know the total number of lines. It is trivial to have perl return you a random number in between 0 and the total number of lines you have. Then simply write that line to your new file, and keep track of the line number(array element) because you don't want to use it again. Now the trick is in optimizing your random number generator function to return a "unique" random number each time, cause you don't want to sit there waiting for a line you haven't used yet, especially as over time the number of lines used out numbers the unused ones. My suggestion is that you create your self a "Random Order List". Since you know the total number of lines you effectively have a sequence of numbers. All you really want to do is jumble that sequence up then write out the now random list "in order", this is typically called a "shuffle". Lets look at some code `# @lines is the old unshuffled array(list of indexes to your Tie::Fil +e array) @lines = 0..51; #for you the sequence would be 1..(scalar @YourTieArray) @shuffled = (); # an empty array $numlines = 52; #scalar @yourTieArray again while ($numlines > 0) { push(@shuffled, splice(@lines, (rand $numlines), 1)); $numcards--; } # @shuffled contains the shuffled lines, @lines is empty` [download] now my thinking says that shuffling a list of intergers, even if that list is big, will take less memory, and less time than having every line reprinted with a random token in front of it, because Once you have your shuffled list of array indexes all you have to do is loop through it and print the corresponding line from your Tie::File array. A discussion of shuffling algorithms that contributed to my comment can be found here: http://c2.com/cgi/wiki?LinearShuffle if you read this let me know if it made a time improvement or not. Ketema	[reply] [d/l]
Re^4: Strategy for randomizing large files via sysseek by BrowserUk (Patriarch) on Sep 14, 2004 at 17:18 UTC
Now try it on a file with a million lines... Update: Apparently, this is seen as sarcasm or otherwise unhelpful; so here are some statistics: Using Tie:File and an inplace Fischer-Yates shuffle to sort various sized files: 100 lines: 58 milliseconds. 1,000 lines: 2 seconds 10,000 lines: 194 seconds 100,000 lines: after 3 1/2 hours of cpu I got sick of listening to the fan thrashing itself to death trying to keep the cpu cool, and aborted. 20, 000,000 (the OP's task): Probably best measured in half-lives of Plutonium. The test code should anyone wish to verify my figures. #! perl -slw use strict; use Tie::File; use Benchmark::Timer; our $N \|\|= 1000; sub shuffle { my $ref = @_ == 1 ? $_[ 0 ] : [ @_ ]; for( 0 .. $#$ref ) { my $p = $_ + rand( @{ $ref } - $_ ); @{ $ref }[ $_, $p ] = @{ $ref }[ $p, $_ ]; } return unless defined wantarray; return wantarray ? @{ $ref } : $ref; } open OUT, '>', 'junk.dat' or die $!; printf OUT "%030d\n", $_ for 0 .. $N; close OUT; my @lines; tie @lines, 'Tie::File', 'junk.dat'; my $T = new Benchmark::Timer; $T->start( "shuffle $N" ); shuffle \@lines; $T->stop( "shuffle $N" ); $T->report; [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^5: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 15, 2004 at 13:02 UTC
Re^6: Strategy for randomizing large files via sysseek by BrowserUk (Patriarch) on Sep 15, 2004 at 14:08 UTC
Some notes below your chosen depth have not been shown here