Re^2: Randomizing Big Files

For a file of 4Gb I need an array that uses 32bits to save each position. 4 byte * n_lines will use to much memory! I have +- 150.000.000 of lines, so is at least 572Mb just to load the positions, than how I handomize 150.000.000 of entries, since is something similar to sort all this entries?!

But what we forget is that an Array in Perl is an array of SCALARs, so, we will use much more memory for each position than just 4 bytes. And randomize this array in Perl will copy the array in the memory and use more meory and will be very slow!

Comment on Re^2: Randomizing Big Files

Replies are listed 'Best First'.
Re^3: Randomizing Big Files by Anonymous Monk on Jan 28, 2005 at 22:34 UTC
First of all, you're right: an index is probably a bad way to solve your problem. It's still possible, (for example, you could generate the index on disk rather than in memory if you needed to), but other monks have already posted other ideas that you might want to try first. But what we forget is that an Array in Perl is an array of SCALARs, so, we will use much more memory for each position than just 4 bytes. And randomize this array in Perl will copy the array in the memory and use more meory and will be very slow! This is off topic, but there's a more efficient way to handle very large arrays in Perl. You don't have to use normal arrays: you can construct a single string of packed bytes (using the pack() command), and access the individual elements using the vec() and unpack() commands. Also, randomizing the array doesn't need to use a copy of the array. Instead, you can randomize your array "in place", by swapping each array element with a random array element, as you loop over the array. I believe this technique is called the "Fischer-Yates shuffle". -- AC	[reply]

Replies are listed 'Best First'.

Re^3: Randomizing Big Files
by Anonymous Monk on Jan 28, 2005 at 22:34 UTC

This is off topic, but there's a more efficient way to handle very large arrays in Perl. You don't have to use normal arrays: you can construct a single string of packed bytes (using the pack() command), and access the individual elements using the vec() and unpack() commands.

Also, randomizing the array doesn't need to use a copy of the array. Instead, you can randomize your array "in place", by swapping each array element with a random array element, as you loop over the array. I believe this technique is called the "Fischer-Yates shuffle".

-- AC

[reply]