Re^2: Randomizing Big Files

Do you know how many time takes to insert 4Gb of data into a DB, since this 4Gb has 150.000.000 entries, than export this data? Tooo muuuchhh! Now add the insertion of a random unique ID, since for each random ID to insert we need to check if it's already in use, what need an index. The concept to create a new SORTED unique ID into a DB is just MAX_ID+1, that is simple, and for a RANDOM unique ID is while( !indexed( rand(N) ) ).

Comment on Re^2: Randomizing Big Files

Replies are listed 'Best First'.
Re^3: Randomizing Big Files by CountZero (Bishop) on Jan 26, 2005 at 19:23 UTC
That would indeed take very long, but reading random lines (or heaven forbid single words!) from a 4 GB file would most probably take even longer. And you don't have to give each record an unique random number, some collisions are acceptable and would not harm the randomness. Say you have 150 million items, then a random number of max. 10 million would lead to 15 items with the same number on the average, but these 15 "same" numbered items would come from randomly different places in your database, so that would not hurt and there is no need to check whether that number is already in use. Keeping a list of all positions of your 150 million items somewhere in an array (which at 24 bytes per item plus the number of bytes to store each value, would flood all but the largest computers) would slow your computer down to a crawl. The concept of "slow" is relative: even something "slow" can be fast if all other options are even slower! CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]

Replies are listed 'Best First'.

Re^3: Randomizing Big Files
by CountZero (Bishop) on Jan 26, 2005 at 19:23 UTC

And you don't have to give each record an unique random number, some collisions are acceptable and would not harm the randomness. Say you have 150 million items, then a random number of max. 10 million would lead to 15 items with the same number on the average, but these 15 "same" numbered items would come from randomly different places in your database, so that would not hurt and there is no need to check whether that number is already in use.

Keeping a list of all positions of your 150 million items somewhere in an array (which at 24 bytes per item plus the number of bytes to store each value, would flood all but the largest computers) would slow your computer down to a crawl.

The concept of "slow" is relative: even something "slow" can be fast if all other options are even slower!

CountZero

"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

[reply]