Do you have to
or more a case ofFile1 --> Randomise --> File1' File2 --> Randomise --> File2' . . Filen --> Randomise --> Filen'
How fast is the script that uses the randomised data capable of using it ? Do you really have to rewrite the entire files ? perhaps you can provide a script that returns random lines from them to STDOUT (and keeps a record so it does not give the same twice).File1 --+ + Filea File2 --+-> Randomise -+ Fileb .... | | .... Filen --+ + Filez
I guess your increasing mem usage comes from the record of which lines you already output. You could return some lines (perhaps 10%), then re-write the source files without the used up lines, rinse repeat...
You could clobber the start of the lines you have used with some unique token XXX-Clobbered-XXX then you do not need to store the used lines record in RAM, of course as you get past 50% your random line will hit a clobbered one more often than not so a re-write of the source data purging the used lines would still be required at some point.
You could put in more RAM or faster disk :)
Cheers,In reply to Re: Strategy for randomizing large files via sysseek
by Random_Walk
in thread Strategy for randomizing large files via sysseek
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |