Re: Randomizing Big Files

Why do you say that a database will be too slow? Did you try it?

If you load the database with the lines of your file and assign a (sufficiently large) random number to an indexed extra field, you can sort on this field and you will have effectively shuffled your large file.

Databases are specifically optimized to do such things, so why re-invent the wheel?

If you have to reshuffle the database, all you have to do is update the field with the random number in it with another random number: one simple SQL-statement will do the trick.

CountZero

"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Comment on Re: Randomizing Big Files

Replies are listed 'Best First'.
Re^2: Randomizing Big Files by Anonymous Monk on Jan 26, 2005 at 15:54 UTC
Do you know how many time takes to insert 4Gb of data into a DB, since this 4Gb has 150.000.000 entries, than export this data? Tooo muuuchhh! Now add the insertion of a random unique ID, since for each random ID to insert we need to check if it's already in use, what need an index. The concept to create a new SORTED unique ID into a DB is just MAX_ID+1, that is simple, and for a RANDOM unique ID is while( !indexed( rand(N) ) ).	[reply]
Re^3: Randomizing Big Files by CountZero (Bishop) on Jan 26, 2005 at 19:23 UTC
That would indeed take very long, but reading random lines (or heaven forbid single words!) from a 4 GB file would most probably take even longer. And you don't have to give each record an unique random number, some collisions are acceptable and would not harm the randomness. Say you have 150 million items, then a random number of max. 10 million would lead to 15 items with the same number on the average, but these 15 "same" numbered items would come from randomly different places in your database, so that would not hurt and there is no need to check whether that number is already in use. Keeping a list of all positions of your 150 million items somewhere in an array (which at 24 bytes per item plus the number of bytes to store each value, would flood all but the largest computers) would slow your computer down to a crawl. The concept of "slow" is relative: even something "slow" can be fast if all other options are even slower! CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]