Strategy for randomizing large files via sysseek

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Strategy for randomizing large files via sysseek (heap unsort) by BrowserUk (Patriarch) on Sep 09, 2004 at 15:17 UTC
One effective method of sorting huge files is a heap sort. read the lines one a time and write them out to a number of files according to which subset of the sorted order the belong to. Eg. All the record where the sort field starts with 'A' go to the A.tmp file, 'B' to B.tmp etc. You then sort those smaller files and then recombine them in the proper order at the end. You can use a similar strategy to randomise your dataset. Open 100 files for output. Process all your files one file/one line at at time. Pick an output file at random to write each record to. Rewind all the temp files. shuffle the handles and merge them back to a final file in random order. This is not fair randomisation. The first record in the final file will definitely be one of the records from the first of the original files read. However, you can then run the program as a second pass supplying the output from the first run. After this second pass, the randomisation will be fair. As you are only ever processing any given file in sequential order, the processing speed should be just about as quick as it can be. `#! perl -slw use strict; use List::Util qw[ shuffle ]; ## omit the glob if your shell expands wildcards for you. BEGIN{ @ARGV = shuffle map glob, @ARGV } my @temps; open $temps[ $_ ], '>', "tmp/$_.tmp" for 0 .. 99; while( <> ) { print { $temps[ rand 100 ] } $_; } seek $temps[ $_ ], 0, 0 for 0 .. 99; open FINAL, '>', "randomised.all" or die $!; for my $in ( shuffle @temps ) { while( <$in> ) { print FINAL; } close $in; } close FINAL;` [download] *Untested. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re: Strategy for randomizing large files via sysseek by Random_Walk (Prior) on Sep 09, 2004 at 14:08 UTC
Random thoughts from Random_Walk Do you have to `File1 --> Randomise --> File1' File2 --> Randomise --> File2' . . Filen --> Randomise --> Filen'` [download] or more a case of `File1 --+ + Filea File2 --+-> Randomise -+ Fileb .... \| \| .... Filen --+ + Filez` [download] How fast is the script that uses the randomised data capable of using it ? Do you really have to rewrite the entire files ? perhaps you can provide a script that returns random lines from them to STDOUT (and keeps a record so it does not give the same twice). I guess your increasing mem usage comes from the record of which lines you already output. You could return some lines (perhaps 10%), then re-write the source files without the used up lines, rinse repeat... You could clobber the start of the lines you have used with some unique token XXX-Clobbered-XXX then you do not need to store the used lines record in RAM, of course as you get past 50% your random line will hit a clobbered one more often than not so a re-write of the source data purging the used lines would still be required at some point. You could put in more RAM or faster disk :) Cheers, R.	[reply] [d/l] [select]
Re^2: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 09, 2004 at 14:39 UTC
Original poster here. First reply: I can't imagine Tie::File can be any faster than using sysseek. That's pretty low-level already. Second reply: It's the second case: all files are randomized to produce many other files. I've even tried combining all input files into a big one to make handling things easier. Having 50 filehandles open on (relatively) smaller files seems to be a little faster however. This is going to be imported elsewhere, so I cannot use the STDOUT trick. I guess your increasing mem usage comes from the record of which lines you already output. You could return some lines (perhaps 10%), then re-write the source files without the used up lines, rinse repeat... Rewriting the source files could get pretty ugly, but that might be a possible solution. I think the memory must be leaking somewhere though, as even a giant hash should not be taking up that much memory. I hope. You could clobber the start of the lines you have used with some unique token XXX-Clobbered-XXX then you do not need to store the used lines record in RAM, of course as you get past 50% your random line will hit a clobbered one more often than not so a re-write of the source data purging the used lines would still be required at some point. I can't rewrite the original files - it would lose the fixed-record-ness. It would also mean that instead of a hash lookup, I would have to sysseek, sysread, check if it has XXX, and try again. The bottleneck is really speed at this point, the memory drain is just worrisome. You could put in more RAM or faster disk :) Faster disk, perhaps. I thought about splitting the files onto different physical disks, to make sysseek faster, but I don't have enough disks to really make a difference. These are big files: I think I did a rough calculation that I would need well ove 4 Gigs of RAM to store all the information in memory.	[reply]
Re^3: Strategy for randomizing large files via sysseek by Random_Walk (Prior) on Sep 09, 2004 at 15:18 UTC
How random do they have to be ? if you make $linesize * n == $blocksize then you can read a random block of n lines from each of the 50 file handles, randomise these lines in mem, write them out. You only have to mark entire blocks as used. The disk reads will be n times more efficient. Of course lines that were within n of each other will stand a good chance of ending up within n*25 after the randomisation. Perhaps it will be fast enough that you can run two passes and this will be good enough for engineering Cheers, R.	[reply]
Re^4: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 09, 2004 at 17:32 UTC
Re^3: Strategy for randomizing large files via sysseek by ketema (Scribe) on Sep 14, 2004 at 16:19 UTC
I think you missed my point on why Tie::File would be good for you. I see you found your solution already, so this is after the fact, but here you go anyway. Tie::File allows you to take a large file and treat it as an array without pulling the entire file into memory. If I understand you correctly you need to make new files from the randomized lines of some really huge file, without messing up the order of the original huge file. what could be easier than an array for this? As soon as you attach to your large file with Tie:File you instanly know the total number of lines. It is trivial to have perl return you a random number in between 0 and the total number of lines you have. Then simply write that line to your new file, and keep track of the line number(array element) because you don't want to use it again. Now the trick is in optimizing your random number generator function to return a "unique" random number each time, cause you don't want to sit there waiting for a line you haven't used yet, especially as over time the number of lines used out numbers the unused ones. My suggestion is that you create your self a "Random Order List". Since you know the total number of lines you effectively have a sequence of numbers. All you really want to do is jumble that sequence up then write out the now random list "in order", this is typically called a "shuffle". Lets look at some code `# @lines is the old unshuffled array(list of indexes to your Tie::Fil +e array) @lines = 0..51; #for you the sequence would be 1..(scalar @YourTieArray) @shuffled = (); # an empty array $numlines = 52; #scalar @yourTieArray again while ($numlines > 0) { push(@shuffled, splice(@lines, (rand $numlines), 1)); $numcards--; } # @shuffled contains the shuffled lines, @lines is empty` [download] now my thinking says that shuffling a list of intergers, even if that list is big, will take less memory, and less time than having every line reprinted with a random token in front of it, because Once you have your shuffled list of array indexes all you have to do is loop through it and print the corresponding line from your Tie::File array. A discussion of shuffling algorithms that contributed to my comment can be found here: http://c2.com/cgi/wiki?LinearShuffle if you read this let me know if it made a time improvement or not. Ketema	[reply] [d/l]
Re^4: Strategy for randomizing large files via sysseek by BrowserUk (Patriarch) on Sep 14, 2004 at 17:18 UTC
Re^5: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 15, 2004 at 13:02 UTC
Some notes below your chosen depth have not been shown here
Re: Strategy for randomizing large files via sysseek by RiotTown (Scribe) on Sep 09, 2004 at 14:28 UTC
It may be overkill, but you can use a database for this functionality. Both mysql and Oracle have efficient ways of loading large datasets into a pre-defined table while obeying constraints on those tables. This may not be the fastest solution, but it should ease up on the memory requirements. MySQL LOAD DATA Oracle SQL*Loader	[reply]
Re^2: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 09, 2004 at 14:41 UTC
And how would these databases return all the rows in the table randomly?	[reply]
Re^3: Strategy for randomizing large files via sysseek by radiantmatrix (Parson) on Sep 09, 2004 at 15:08 UTC
Random selection back out of the DB shouldn't be that hard if you build the table correctly. Use an ID feild that gives the order of insertion (i.e. first row inserted gets "0", next gets "1", etc.). Then populate an array with those values and randomize it using established methods. Your array might now look like: `50, 30, 98, 3, 6, 127, 42 ...` [download] You can then do something like: `# @rand_array contains the randomized array $sth = $dbh->prepare("SELECT data FROM my_table WHERE ID=?"); foreach $id (@rand_array) { $sth->execute($id); print $OUTFILE join('',$sth->fetchrow_array); #there will only be o +ne element, no worries. }` [download] As long as you keep track of which ID's you create during your INSERT session, this will work smashingly. I also recommend issuing a CREATE TABLE at the start and a DROP TABLE at the end to ensure that you aren't duplicating once the file is written. ^{-- $me = rand($hacker{perl});}	[reply] [d/l] [select]
Re^3: Strategy for randomizing large files via sysseek by poqui (Deacon) on Sep 09, 2004 at 15:04 UTC
How random do you need them to be? Oracle will return the rows in physical order (which may match input order if you haven't done a lot of page splits during the load) but not in sorted order unless you specifically say "Order By" in the query. How important is the randomness? You could approach some level of randomness by repeatedly catting the files together and then splitting them at different points and catting them together again... kind of like shuffling cards...	[reply]
Re^4: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 09, 2004 at 15:08 UTC
Re^3: Strategy for randomizing large files via sysseek by RiotTown (Scribe) on Sep 09, 2004 at 15:16 UTC
Are the initial files random, or are they already in some sorted order? If they are already random, just concatenate them together into one giant load file. Once loaded you can pull them out based on rownum so they will come out in the exact order they went into the table. Not truly random, but random as the initial files. If the initial files are ordered, the above solution doesn't work. For a mySQL based table, you could do something like `SELECT * FROM BLA ORDER BY RAND()` although I'm not sure of exactly how random the results will be.	[reply] [d/l]
Re^4: Strategy for randomizing large files via sysseek by RiotTown (Scribe) on Sep 09, 2004 at 15:23 UTC
Re^3: Strategy for randomizing large files via sysseek by Arunbear (Prior) on Sep 09, 2004 at 15:25 UTC
Using Mysql you could do this `SELECT * FROM table ORDER BY RAND() LIMIT 1;` [download] to select one record at a time, or leave out the LIMIT to select all records.	[reply] [d/l]
Re^4: Strategy for randomizing large files via sysseek by quai (Novice) on Sep 09, 2004 at 18:21 UTC
Re: Strategy for randomizing large files via sysseek by bluto (Curate) on Sep 09, 2004 at 15:42 UTC
The following low tech solution may or may not work for you -- it trades millions of seeks for several sequential rewrites of the file (similar to BrowserUks solution) and doesn't require much code at all. Rewrite the files with a random number prepended or appended to each line. Run the new files through GNU sort but be sure to use the -T flag to send temp files to a place where you have some free disk preferrably on a different physical disk, (enough for about 1.5 - 2x the size of the largest file), you could send the output through a pipe where a process then removes the random number from each line. NOTE: I recall being able to sort text files close to a GB with GNU sort, in reasonable time, but I'm not sure how it handles files much large than this.	[reply]
Re^2: Strategy for randomizing large files via sysseek by BrowserUk (Patriarch) on Sep 10, 2004 at 01:49 UTC
Why trade an O(2N) solution for an O(N)+O(N log N)* one? *At best Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^3: Strategy for randomizing large files via sysseek by bluto (Curate) on Sep 10, 2004 at 15:45 UTC
The main benefit of course is that there is less code to test/maintain, hence the "low tech" caveat. If it's fast enough for the OP, then it's worth considering. I'm sure the OP can figure this out easily by testing. On a lesser note I am not convinced that your solution will fairly shuffle the lines, even after a second pass -- though if the number of temp files and or passes is scaled with the number of total lines this may be sufficient for the OP's needs. (I.e. tempfiles^passes probably needs to equal or exceed total number of lines to sort, but that's a WAG).	[reply]
Re^4: Strategy for randomizing large files via sysseek by BrowserUk (Patriarch) on Sep 11, 2004 at 00:22 UTC
Re^5: Strategy for randomizing large files via sysseek by bluto (Curate) on Sep 13, 2004 at 19:12 UTC
Re^3: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 10, 2004 at 17:58 UTC
Well, your solution maybe linear - it doesn't remove duplicates, which is part of the requirement.	[reply]
Re^2: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 09, 2004 at 18:28 UTC
(OP) Thanks to everyone. I am going to try this solution out. Will report back tomorrow on how it went. FWIW, the memory finally "maxed out" at around 700M and actually went down a little bit at around hour 18.	[reply]
Re: Strategy for randomizing large files via sysseek by ketema (Scribe) on Sep 09, 2004 at 13:35 UTC
Have you tried Tie-File? Its a great module that treats a file like an array, and DOES NOT pull the entire file into memory. Great for working with LARGE files like you are describing.	[reply]
Re: Strategy for randomizing large files via sysseek by johnnywang (Priest) on Sep 09, 2004 at 18:00 UTC
I had a situation where a huge file needs to be sorted, I used the heap sort as described by BrowserUK above. One difference is that I used a hash function to compute which file a line should go to, which give more balanced distribution into smaller files. I also ran into the memory problem, and I've never really figured out why my script was using that much memory. What I did was to call the script itself using backtick/system() so that the memory will be freed up after each smaller file is processed. Depending on how random you want, you can use a similar approach: split one large file into smaller files randomly, then randomly combine them into another large file, and run this a few times. Use backtick/system() to free up the memory between runs.	[reply]
Re: Strategy for randomizing large files via sysseek by TilRMan (Friar) on Sep 10, 2004 at 03:39 UTC
Here's my divide-and-conquer approach: Split incoming lines randomly among 50 tempfiles. Recurse on the tempfiles. When the tempfiles are smaller than 1000 lines, shuffle in memory. Spit out the shuffled tempfiles sequentially. Presumably, adjusting the knobs (higher) to match your system will improve performance. I believe that this is a fair randomness (assuming fairness of `shuffle()` and `rand`), uses constant memory, runs in O(n log n) time, and uses "O(n)" disk. (I may be wrong about these -- comments welcome.) The big advantage is that there is no random seeking, so you get the benefit of ~~caches~~buffers. #!/usr/bin/perl use strict; use warnings; # Shuffle unlimited number of lines from ARGV to STDOUT # using tempfiles my $SPREAD = 50; my $TERMINATE = 1000; use File::Temp qw( tempfile ); use IO::File; use List::Util qw( shuffle ); sub shuffle_lines { my ( $infh, $outfh ) = @_; my @temp = map { scalar tempfile } 1 .. $SPREAD; my @count = (0) x $SPREAD; while (<$infh>) { my $i = int rand $SPREAD; $temp[$i]->print($_); $count[$i]++; } for my $i ( 0 .. $#count ) { my $fh = $temp[$i]; seek $fh, 0, 0; if ( $count[$i] <= $TERMINATE ) { print $outfh shuffle <$fh>; } else { shuffle_lines( $fh, $outfh ); } } } shuffle_lines( \ARGV, \STDOUT ); [download]	[reply] [d/l] [select]
Re: Strategy for randomizing large files via sysseek by mhi (Friar) on Sep 09, 2004 at 18:41 UTC
Okay, since chunking is not allowed and you're noticing that working on small files is faster than big ones, I propose the following: Figure out, how many lines you're going to have or just define a $maxlines that's definitely going to be bigger, but not by too many orders. Go through your original data sequentially. Assign each line encountered an unused random number less than your $maxlines. Use the first part of this number as a file index (to determine the target file). Sequentially write the second part of the number as an line-index within the file, followed by the dataline into the target file. Add the random number to your used up number list. Individually sort each of the target files by the line-index. Remove the line-indexes from your target files. If the target files need to be different sizes, just go through them sequentially and create new ones with the lengths of your choice from them. This way you'll do all the reading and writing sequentially, except for when you're sorting the target files. You can fine-tune their size by tweaking the size of the file-index as compared to that of the line-index. You might even want to keep your target files short enough, so each of them can be read into memory in one pass and sorted there, thereby virtually eliminating non-sequential disk access. Have fun! Update: Actually, I now realize this isn't unlike bluto's solution, except that it's much more explicit about not rewriting each file for itself. Update: TilRMan's solution made me add the sentence about sorting in memory.	[reply]
Re: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 09, 2004 at 15:59 UTC
`sort -u input.txt \|\ perl -pe '$r = substr(rand(), 2); $_ = "$r\t$_"' \|\ sort -n \| cut -f 2- > output.txt` [download] `sort` is optimized to work well with really large files. You may want to use the -S and -T parameters on sort.	[reply] [d/l]
Re^2: Strategy for randomizing large files via sysseek by Anonymous Monk on May 11, 2008 at 15:36 UTC
Works well, thank you much!	[reply]
Re: Strategy for randomizing large files via sysseek by duff (Parson) on Sep 09, 2004 at 15:21 UTC
Can you show us your code? You're asking for optimization help, but are being vague about specifics. Optimization is all about the details :) duff	[reply]
Re: Strategy for randomizing large files via sysseek by traveler (Parson) on Sep 09, 2004 at 22:38 UTC
I'm not sure this would be random enough but it might be. Open n output files (of1..ofn). Open the n input files (if1..ifn). Read a line from a random input file and write it to a random output file. If you had to you could shuffle the output files later, but it would end up seeking somewhere along the line. --traveler	[reply]
Re: Strategy for randomizing large files via sysseek by Anonymous Monk on Sep 10, 2004 at 15:36 UTC
Original poster here. All is well. Used a combination of shell and perl to get the job done. Here's a rundown of the final solution: Open a new file for writing on another filesystem. Go through each input file and rewrite it to the new file using printf and int rand(1_000_000) to make lines with a random number at the start of each one. Use the *nix sort command to sort this file to a new one, writing to a different filesystem, and using the -T flag to put the temporary stuff on another filesystem. This took 45 minutes, but worked perfectly. (I tried combining 3 and 4 by having perl parse the large file and write to separate files, but it ran out of memory) Split the large sorted file up using "split -l 250000 sortedfile myfiles_" Have a perlscript open each file, remove the number and at the start of it, and write to the final finished files. This worked fine - no memory problems, little I/O impact, and finished in a reasonable amount of time (< 4 hours). Thanks everyone!	[reply]
Re: Strategy for randomizing large files via sysseek by mkirank (Chaplain) on Sep 10, 2004 at 13:37 UTC
I was having similar problem ...as the file size goes up there is too much thrashing , best thing is to use a database , Try using DBD::SQLite2 its Insertion is fast and if u want u can even sort the data .	[reply]