in reply to Re^7: Strategy for randomizing large files via sysseek
in thread Strategy for randomizing large files via sysseek
Okay. You've avoided the major slowdown by not using Tie::File for both reading and writing--but there are still a problems with your approach.
You probably think that by using an array of indices rather than an array of lines, you are saving large amounts of memory. This is not the case.
A perl array uses at least 24 bytes of memory for every element of the array before you actually store anything in those elements. So, your array of indexes saves very little memory compared to storing the lines of the file.
In the OP's case of at least 20 million lines, that is at least 20 million indexes x 24 bytes 480 MB.
I'd be interested in testing what I can up with for your data file. If you want to post it somewhere let me know, and I'll see if I can beat your 4 hour mark.
The data file I used for the 1 billion lines test was a simple extension of that used above:
my $n=0; printf "%030d\n", $n++ while $n < 1E9;
To test your code, just substitute that line into your program above; but be prepared for a looong wait :)
The first long wait will come when the line:
my @indexList = (0..scalar @lines);
is executed. In order for Tie::File to determine the size of the array (scalar @lines), it will have to read every line in the file. Whilst wc -l will do this for a 32 GB file in say 10 minutes, it will take Tie::File a great deal longer. This is because as well as reading every line of the file sequentially and discarding it--as wc -l would--, it also builds a hash of the offset of the start of each line.
Each element of a hash is 21 bytes for each line. For 1 billion lines that amounts to at least 19.5 GB!.
|
|---|