Re^3: Randomizing Big Files

That would indeed take very long, but reading random lines (or heaven forbid single words!) from a 4 GB file would most probably take even longer.

And you don't have to give each record an unique random number, some collisions are acceptable and would not harm the randomness. Say you have 150 million items, then a random number of max. 10 million would lead to 15 items with the same number on the average, but these 15 "same" numbered items would come from randomly different places in your database, so that would not hurt and there is no need to check whether that number is already in use.

Keeping a list of all positions of your 150 million items somewhere in an array (which at 24 bytes per item plus the number of bytes to store each value, would flood all but the largest computers) would slow your computer down to a crawl.

The concept of "slow" is relative: even something "slow" can be fast if all other options are even slower!

CountZero

"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Comment on Re^3: Randomizing Big Files