Re: Randomizing Big Files

How about just randomizing the line numbers as an array? First, run open(FH, "wc -l filename |") and grab the line count. Then read 8.6, 8.7, 8.8 from the Cookbook. Instead of reading all the lines into an array, fill the array with its index and shuffle it (4.17 in the Perl Cookbook). Then, read to each line number then reset. You only need to open the old file once, just reset its pointer using $. = 0 and loop through the array of line numbers.

Comment on Re: Randomizing Big Files

Replies are listed 'Best First'.
Re^2: Randomizing Big Files by Aristotle (Chancellor) on Jan 26, 2005 at 15:23 UTC
reset its pointer using $. = 0 No, that won't do what you're after, it resets the line counter but doesn't affect the file pointer. You want `seek FH, 0, 0` instead. Makeshifts last the longest.	[reply] [d/l]
Re^2: Randomizing Big Files by Anonymous Monk on Jan 26, 2005 at 15:45 UTC
Just sort the line numbers won't work since my lines doesn't have fixed size! So, when I will create the new file I will always search for that line what will be very slow. Can work if we create a support algorithm to search the lines, using some know lines to be the start to search the others, but we can't forget that will still be slow since the file is too big!	[reply]
Re^3: Randomizing Big Files by samizdat (Vicar) on Jan 26, 2005 at 17:18 UTC
Another question: How many times do you need to do this? Why is efficiency so important? If you only need to do it once, just code it, run it, and be done. :D	[reply]
Re^4: Randomizing Big Files by Anonymous Monk on Jan 27, 2005 at 11:19 UTC
If you only need to do it once, just code it, run it, and be done. We need something eficient because with a normal code is just impossible to run with a file of 4Gb! I just don't have 1Gb of Ram to be able to load everything in the memory.	[reply]
Re^3: Randomizing Big Files by samizdat (Vicar) on Jan 26, 2005 at 17:16 UTC
Thanks to Aristotle for the seek correction! New suggestion: Okay, read by character, building array of positions of '\n's. Shuffle that, then read between the '\n's. into new file.	[reply]