in reply to Strategy for randomizing large files via sysseek
Do you have to
or more a case ofFile1 --> Randomise --> File1' File2 --> Randomise --> File2' . . Filen --> Randomise --> Filen'
How fast is the script that uses the randomised data capable of using it ? Do you really have to rewrite the entire files ? perhaps you can provide a script that returns random lines from them to STDOUT (and keeps a record so it does not give the same twice).File1 --+ + Filea File2 --+-> Randomise -+ Fileb .... | | .... Filen --+ + Filez
I guess your increasing mem usage comes from the record of which lines you already output. You could return some lines (perhaps 10%), then re-write the source files without the used up lines, rinse repeat...
You could clobber the start of the lines you have used with some unique token XXX-Clobbered-XXX then you do not need to store the used lines record in RAM, of course as you get past 50% your random line will hit a clobbered one more often than not so a re-write of the source data purging the used lines would still be required at some point.
You could put in more RAM or faster disk :)
Cheers,
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Strategy for randomizing large files via sysseek
by Anonymous Monk on Sep 09, 2004 at 14:39 UTC | |
by Random_Walk (Prior) on Sep 09, 2004 at 15:18 UTC | |
by Anonymous Monk on Sep 09, 2004 at 17:32 UTC | |
by ketema (Scribe) on Sep 14, 2004 at 16:19 UTC | |
by BrowserUk (Patriarch) on Sep 14, 2004 at 17:18 UTC | |
by Anonymous Monk on Sep 15, 2004 at 13:02 UTC | |
by BrowserUk (Patriarch) on Sep 15, 2004 at 14:08 UTC | |
|