in reply to Randomizing Big Files

Your speed is going to be limited by your disk time. Which means that the solutions that randomize all the byte offsets of the lines and then read them one by one will be murderously slow, since they'll have worst-case seek performance. Trust me, this would be really, really, really bad.

I would probably tackle it with something along these lines:

Assume you have a way of knowing the sequence numbers of each of the lines, in order of the lines. In other words, you have something that can incrementally generate a permutation of 0..N-1, where N is the number of lines.

Now figure out how much usable memory your machine has, and compute the average line length. Use these values to compute n=memory/ave-line-length. This is the number of lines you can fit in memory.

Now scan through your file. Store in memory every line whose rank is less than n. When you've gone through the entire file once, you will have the first n lines in memory. Write them out in order. (To make this fast, you'd want to collect them in memory in order and then just write everything out more or less sequentially. But you'll have to be careful about allocating the right amount of space for subbatches of lines and things. You might want to use byte offsets of your lines in memory and sort them by rank in memory, just so the space usage is perfectly predictable.)

Repeat for the lines ranked n..2n-1. Repeat, repeat, repeat, until you've gone through all N.

It'll take N/n passes, but you'll be doing all sequential I/O.

In practice, you don't want to use n=memory/line-length. You'll need to modify that by how much each line really uses as a scalar value, and cut it down by some factor so that you don't overflow memory if you're unlucky. (Unless your line lengths have a really strange distribution, though, the sum of the lengths of all the lines in each set of n isn't going to vary all that much, though.)

Now about that permutation. You don't really need to honor the same permutation for the selection of n lines and for the ordering within those n lines. You can just seed some pseudo-random number generator with the same value for each pass, and then use a smaller permutation of 0..n-1 using some other technique for the in-memory reshuffle. I confess that I'm coming up with all of this off the top of my head, but I wonder if it would be random enough for you if within the n lines, you just started at index int(rand(n)), then kept adding some prime number to it and taking that one modulus n. Because your prime number is relatively prime to n (duh), you will walk through all n elements before repeating an index. So you can do this as you read in the n elements, and use it to figure out an index to store the offset of the line directly in memory.

So in perl, you would have one enormous scalar that held all of your lines, and one packed scalar of offsets that held your ordering. Every time you selected a line for the big file, you'd add on your prime number, mod it by n, and store the current offset into that "index" of your ordering variable. Then append your line to the set of lines. After you had all n, loop through the packed ordering vector and write out each line. (Keep the newlines, or you'll have to track the lengths separately!)

I'm sure there's some way to preserve the simplicity but get better randomization, but my inspiration has dried up for tonight.

Good luck!

Replies are listed 'Best First'.
Re^2: Randomizing Big Files
by Aristotle (Chancellor) on Jan 27, 2005 at 18:01 UTC

    You made me curious enough to do the math. Indeed, two seeks per line (one to read, one to write) at 10 ms each add up to 56 hours(!) if you assume a set of 10,000,000 lines to randomize. In a 4GB file with 10,000,000 lines a line would be 430 bytes long on average, so it's probably safe to assume that the OP's actual problem would require 4-8× as much time. Waiting a week for the randomization of a 4GB file doesn't sound very appealing…

    Makeshifts last the longest.