in reply to Strategy for randomizing large files via sysseek

I had a situation where a huge file needs to be sorted, I used the heap sort as described by BrowserUK above. One difference is that I used a hash function to compute which file a line should go to, which give more balanced distribution into smaller files.

I also ran into the memory problem, and I've never really figured out why my script was using that much memory. What I did was to call the script itself using backtick/system() so that the memory will be freed up after each smaller file is processed.

Depending on how random you want, you can use a similar approach: split one large file into smaller files randomly, then randomly combine them into another large file, and run this a few times. Use backtick/system() to free up the memory between runs.

  • Comment on Re: Strategy for randomizing large files via sysseek