Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Felow monks,

I have chalange.

How can I randomize the lines of a very big file using just pure Perl?

I have here a file of 4Gb, and I need to randomize this file to be able to distribute this lines over many process in a way that we don't create a pattern of the order of the entries in the line, making a balanced distribution of the data that will be used to make searches. Since if I have all the process with random entries I will have a system making more balanced searches of the information and this also won't overload the main DB grid where the search is done, since each type of search is distributed over different servers int the grid.

Randomize a big file is a big problem, since I need to work with all the lines in the file and I can't try to make a solution that randomizes by small blocks, since a line need to be distributed in all the range of the file, and not inside each block. Also I can't load all the file in the memory, even inside an array, or send to a DB since a DB will make the process toooo sloooww.

So, how to do that with pure Perl, fast and with less memory?!

By gmpassos.

Replies are listed 'Best First'.
Re: Randomizing Big Files
by BrowserUk (Patriarch) on Jan 26, 2005 at 13:13 UTC

    You may find the thread at Strategy for randomizing large files via sysseek useful.

    One thing that is not clear from your post is whether you want to actually shuffle the lines in your 4GB file, or just find a mechanism for allowing many processes to be given unique random subset of the lines from that file?

    It may be that they amount to the same thing in that shuffling the large file is possibly the easiest way, but it may be possible to allow each process to select lines at random without overlaps more efficiently.

    Also whether you would accept a "nearly random" distribution as opposed to statistically valid randomness , may allow a more efficient, slightly less than really random solution.


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
Re: Randomizing Big Files
by CountZero (Bishop) on Jan 26, 2005 at 13:03 UTC
    Look at it from another side: would it work for your purpose if you randomize the distribution of the data (on a line by line basis? This is not clear in your question) in the file over the different processes, rather than randomizing the data in the file and then handing the randomized data sequentially to the various processes?

    Perhaps we can be of more help if you describe what exactly is the set-up and what needs to be done.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      If you split the big file in small files, than randomize each small file you actually are not makeing a full randomization!

      Is like split the dict by the 1st letter making groups from A to Z and randomize each group, so the a... words will always be in the begining of the data. What I want is to distribute all the data over the file, so when the process is reading the list from the top it always will be reading a randomized sequence with the same probability to get any word of the full file.

        All~

        Actually you can make this work. You use the same technique as a split-and-merge sort. After you have randomized each subgroup, you collect them together into one big random group by randomly selecting which group to take the first line from until they are empty.

        This might even be a good way to do something like this as split and merge is the technique employed by databases to sort things that wont fit in memory.

        First break the file into memory sized chunks and use the Fisher-Yates Shuffle to randomize them. Then take the randomized runs and shuffle them together by randomly selecting which one to take from next. Repeat the process until you have one one very large random run. I know it will provide an even distribution if you combine all of the random runs simultaneously, I have yet to convince myself whether or not it will still provide even distributions if your shuffle only takes two random runs.

        UPDATE: I have convinced myself that you can do it by repeatedly shuffling together two runs, as long as you select the runs to shuffle together randomly every time.

        Boots
        ---
        Computer science is merely the post-Turing decline of formal systems theory.
        --???
Re: Randomizing Big Files
by samizdat (Vicar) on Jan 26, 2005 at 13:27 UTC
    How about just randomizing the line numbers as an array? First, run open(FH, "wc -l filename |") and grab the line count. Then read 8.6, 8.7, 8.8 from the Cookbook. Instead of reading all the lines into an array, fill the array with its index and shuffle it (4.17 in the Perl Cookbook). Then, read to each line number then reset. You only need to open the old file once, just reset its pointer using $. = 0 and loop through the array of line numbers.

      reset its pointer using $. = 0

      No, that won't do what you're after, it resets the line counter but doesn't affect the file pointer. You want seek FH, 0, 0 instead.

      Makeshifts last the longest.

      Just sort the line numbers won't work since my lines doesn't have fixed size! So, when I will create the new file I will always search for that line what will be very slow.

      Can work if we create a support algorithm to search the lines, using some know lines to be the start to search the others, but we can't forget that will still be slow since the file is too big!

        Another question: How many times do you need to do this? Why is efficiency so important? If you only need to do it once, just code it, run it, and be done. :D
        Thanks to Aristotle for the seek correction!

        New suggestion: Okay, read by character, building array of positions of '\n's. Shuffle that, then read between the '\n's. into new file.
Re: Randomizing Big Files
by Aristotle (Chancellor) on Jan 26, 2005 at 15:20 UTC

    This should be about as efficient as you will get. It works in two passes: find start-of-line offsets on the first, then a long loop of seek-and-read directly to the respective offsets after shuffling indices.

    Update: code changed per Anonymonk's hint. Old code:

    New code:

    #!/usr/bin/perl use strict; use warnings; use POSIX qw( ceil ); use List::Util qw( shuffle ); my $basepos = 0; open my $fh, '<', $ARGV[ 0 ] or die "Couldn't open $ARGV[ 0 ] for reading: $!\n"; my $offs; vec( $offs, 0, 32 ) = 0; my $total_lines = 1; while( my $advance = read $fh, $_, 128 * 1024 ) { s/(?=\n)/vec( $offs, $total_lines++, 32 ) = $basepos + pos(); ""/e +g; $basepos += advance; } # we will be looking for "start of following line", so this is # a bogus entry to obviate need for special case at last line vec( $offs, $total_lines, 32 ) = $basepos; my $been_here = "\0" x ceil( $total_lines / 8 ); my $i = 0; while( $i < $total_lines ) { my $line = int rand $total_lines; next if vec $been_here, $line, 1; my $start = vec $offs, $line, 32; my $end = vec $offs, $line + 1, 32; seek $fh, $start, 0; read $fh, $_, $end - $start; print; ++$i; vec( $been_here, $line, 1 ) = 1; }

    Untested, but you get the idea.

    Note: I'm not sure this shuffle is entirely fair. I don't think it is.

    This uses $offs as a packed array of longs, and $been_here as a packed array of bools. Memory consumption with this approach will be proportional to the number of lines in your file, but not nearly as prohibitively so as with native Perl data structures (an array of numbers takes about 20 bytes per element). If you're really short of memory, you could use Sys::Mmap to use the structures from disk very efficiently; that would also deftly accelerate the initial gathering of offsets.

    Makeshifts last the longest.

      Something in this way can work, I will just point that a SCALAR in Perl is not a char*, since it can have UTF8/UNICODE values. We should use vec() to ensure that the data will be exactly the same that is appended with pack(). Also vec() already does the pack trick for us:
      vec($offs , $i , 32) = $n ;

        Yes, good points. Using vec would actually simplify the code, even.

        Makeshifts last the longest.

Re: Randomizing Big Files
by dave_the_m (Monsignor) on Jan 26, 2005 at 13:36 UTC
    If you definitely want to create a copy of the file with its lines randomly shuffled, then the approaches that spring to my mind are:

    1) make sure you have several Gbs of space free in /tmp, then

    $ perl -pe '$_ = rand(100000000). " $_"' bigfile \ | sort -k 1n | perl -pe 's/^\d+ //' > bigfile_sorted
    Sort is generally quite efficient at sorting large files. Or you're using an OS that doesn't have a sort utility capable of handling a 4Gb+ file, then

    2) randomly distribute the lines into a number of smaller files, individually randomise them, then concatenate them

    perl -ne 'BEGIN { for (1..16) { open my $fh, ">tmp$_"; push @f, $fh" } print { $f[rand16]} $_' bigfile (randomise the files tmp1 .. tmp16), then $ cat tmp* > bigfile_sorted $ rm tmp*

    Dave.

Re: Randomizing Big Files
by Anonymous Monk on Jan 26, 2005 at 13:26 UTC
    What kind of constraints do you have? Does each process needs to get the same amount of bytes? The same amount of lines? Or if you have N processes, will approx. 1/N of the lines do? Can the lines be fed in the same order as in the file, or does that have to be random as well? Can you just chop the file into N equal pieces and shuffle those? If not, why not? Would the following algorithm do?
    1. Let the number of lines in the file be L. Let the number of processes be N. Let Ai = 0 for 0 <= i < N. Label the processes P_i, 0 <= i < N.
    2. For each line l, 0 <= l < L, assign this line to process P_i with chance (L/N - A_i)/(L - l). For the process P_j which is picked this way, increment A_j.
    3. If necessary, shuffle the lines assigned to each process.

    Some pseudo, untested, code:

    my $L = ... number of lines ....; my $N = ... number of processes ....; # Open filehandles for each process: my @fh; for (my $i = 0; $i < $N; $i ++) { open $fh[$i], "| $process" or die; } # Initialize @A: my @A = (0) x $N; my $l = 0; # Iterate over the input. while (<$input>) { # Pick a number. my $r = rand($L - $l); # Find process. my $i = 0; while ($r >= ($L / $N - $A[$i])) { $r -= $L / $N - $A[$i]; $i ++; } # Write line. print $fh[$i] $_; # Increment array. $A[$i]++; # Increment line counter; $l++ }
      ... my $r = rand($L - $l); ...
      This just make possible to sort the same line 2 times, also you are not sure if you will get all the lines!!!!!!!!
Re: Randomizing Big Files
by erix (Prior) on Jan 26, 2005 at 13:15 UTC

    Gather the offsets of line-beginnings (with tell() or even 'grep -b') into a array, then randomize?

    Getting to a line knowing the pointer will take little time.

      For a file of 4Gb I need an array that uses 32bits to save each position. 4 byte * n_lines will use to much memory! I have +- 150.000.000 of lines, so is at least 572Mb just to load the positions, than how I handomize 150.000.000 of entries, since is something similar to sort all this entries?!

      But what we forget is that an Array in Perl is an array of SCALARs, so, we will use much more memory for each position than just 4 bytes. And randomize this array in Perl will copy the array in the memory and use more meory and will be very slow!

        First of all, you're right: an index is probably a bad way to solve your problem. It's still possible, (for example, you could generate the index on disk rather than in memory if you needed to), but other monks have already posted other ideas that you might want to try first.

        But what we forget is that an Array in Perl is an array of SCALARs, so, we will use much more memory for each position than just 4 bytes. And randomize this array in Perl will copy the array in the memory and use more meory and will be very slow!

        This is off topic, but there's a more efficient way to handle very large arrays in Perl. You don't have to use normal arrays: you can construct a single string of packed bytes (using the pack() command), and access the individual elements using the vec() and unpack() commands.

        Also, randomizing the array doesn't need to use a copy of the array. Instead, you can randomize your array "in place", by swapping each array element with a random array element, as you loop over the array. I believe this technique is called the "Fischer-Yates shuffle".

        -- AC

Re: Randomizing Big Files
by sfink (Deacon) on Jan 27, 2005 at 07:34 UTC
    Your speed is going to be limited by your disk time. Which means that the solutions that randomize all the byte offsets of the lines and then read them one by one will be murderously slow, since they'll have worst-case seek performance. Trust me, this would be really, really, really bad.

    I would probably tackle it with something along these lines:

    Assume you have a way of knowing the sequence numbers of each of the lines, in order of the lines. In other words, you have something that can incrementally generate a permutation of 0..N-1, where N is the number of lines.

    Now figure out how much usable memory your machine has, and compute the average line length. Use these values to compute n=memory/ave-line-length. This is the number of lines you can fit in memory.

    Now scan through your file. Store in memory every line whose rank is less than n. When you've gone through the entire file once, you will have the first n lines in memory. Write them out in order. (To make this fast, you'd want to collect them in memory in order and then just write everything out more or less sequentially. But you'll have to be careful about allocating the right amount of space for subbatches of lines and things. You might want to use byte offsets of your lines in memory and sort them by rank in memory, just so the space usage is perfectly predictable.)

    Repeat for the lines ranked n..2n-1. Repeat, repeat, repeat, until you've gone through all N.

    It'll take N/n passes, but you'll be doing all sequential I/O.

    In practice, you don't want to use n=memory/line-length. You'll need to modify that by how much each line really uses as a scalar value, and cut it down by some factor so that you don't overflow memory if you're unlucky. (Unless your line lengths have a really strange distribution, though, the sum of the lengths of all the lines in each set of n isn't going to vary all that much, though.)

    Now about that permutation. You don't really need to honor the same permutation for the selection of n lines and for the ordering within those n lines. You can just seed some pseudo-random number generator with the same value for each pass, and then use a smaller permutation of 0..n-1 using some other technique for the in-memory reshuffle. I confess that I'm coming up with all of this off the top of my head, but I wonder if it would be random enough for you if within the n lines, you just started at index int(rand(n)), then kept adding some prime number to it and taking that one modulus n. Because your prime number is relatively prime to n (duh), you will walk through all n elements before repeating an index. So you can do this as you read in the n elements, and use it to figure out an index to store the offset of the line directly in memory.

    So in perl, you would have one enormous scalar that held all of your lines, and one packed scalar of offsets that held your ordering. Every time you selected a line for the big file, you'd add on your prime number, mod it by n, and store the current offset into that "index" of your ordering variable. Then append your line to the set of lines. After you had all n, loop through the packed ordering vector and write out each line. (Keep the newlines, or you'll have to track the lengths separately!)

    I'm sure there's some way to preserve the simplicity but get better randomization, but my inspiration has dried up for tonight.

    Good luck!

      You made me curious enough to do the math. Indeed, two seeks per line (one to read, one to write) at 10 ms each add up to 56 hours(!) if you assume a set of 10,000,000 lines to randomize. In a 4GB file with 10,000,000 lines a line would be 430 bytes long on average, so it's probably safe to assume that the OP's actual problem would require 4-8× as much time. Waiting a week for the randomization of a 4GB file doesn't sound very appealing…

      Makeshifts last the longest.

Re: Randomizing Big Files
by neniro (Priest) on Jan 26, 2005 at 13:15 UTC
    I would try it using Tie::File and the shuffle()-function from List::Util. Doing it without any cpan-modules sounds really hard to me.

      See 390913 for why this won't work well for a file this big.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        That's extreme. I haven't expected that it will last so long.

        Thanks in advance,
        neniro

Re: Randomizing Big Files
by CountZero (Bishop) on Jan 26, 2005 at 15:17 UTC
    Why do you say that a database will be too slow? Did you try it?

    If you load the database with the lines of your file and assign a (sufficiently large) random number to an indexed extra field, you can sort on this field and you will have effectively shuffled your large file.

    Databases are specifically optimized to do such things, so why re-invent the wheel?

    If you have to reshuffle the database, all you have to do is update the field with the random number in it with another random number: one simple SQL-statement will do the trick.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Do you know how many time takes to insert 4Gb of data into a DB, since this 4Gb has 150.000.000 entries, than export this data? Tooo muuuchhh! Now add the insertion of a random unique ID, since for each random ID to insert we need to check if it's already in use, what need an index. The concept to create a new SORTED unique ID into a DB is just MAX_ID+1, that is simple, and for a RANDOM unique ID is while( !indexed( rand(N) ) ).
        That would indeed take very long, but reading random lines (or heaven forbid single words!) from a 4 GB file would most probably take even longer.

        And you don't have to give each record an unique random number, some collisions are acceptable and would not harm the randomness. Say you have 150 million items, then a random number of max. 10 million would lead to 15 items with the same number on the average, but these 15 "same" numbered items would come from randomly different places in your database, so that would not hurt and there is no need to check whether that number is already in use.

        Keeping a list of all positions of your 150 million items somewhere in an array (which at 24 bytes per item plus the number of bytes to store each value, would flood all but the largest computers) would slow your computer down to a crawl.

        The concept of "slow" is relative: even something "slow" can be fast if all other options are even slower!

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Randomizing Big Files
by thospel (Hermit) on Jan 27, 2005 at 20:42 UTC
    Here is some code I wrote when someone asked this question on irc some time ago (randomizing a big file that is, ignoring the distribute over processes aspect). It's a self-tuning recursive distribute/shuffle/collect. It tries hard to be friendly on memory and filesystem cache and only do streaming disk accesses. The drawback for your case might be that it takes a long time before the first results start coming out of this.
    #! /usr/bin/perl -w # Shuffle the lines in a file potentially much bigger than memory # Author: Ton Hospel # License: GNU or artistic use strict; use File::Path; use List::Util qw(shuffle); # Limits on the helper chunks my $max_size = 10e6; my $max_files = 128; my $dir = "Shuffle.$$"; sub big_shuffle { my ($d, $in, $out) = @_; my $files; if (-f $in) { $files = int(($max_size - 1 + -s _) / $max_size); $files = $max_files if $files > $max_files; if ($files == 1 || $files == 2 && $max_size * 1.5 > -s _) { print($out shuffle(<$in>)) || die "Unexpected write error: + $!\n"; return; } } else { $files = $max_files; } my $format = sprintf("%s%s%%0%dd", $d, $d eq $dir ? "/" : ".", length($file +s)); my (@fhs, @names); for (0..$files-1) { $names[$_] = sprintf($format, $_+1); open($fhs[$_], ">", $names[$_]) || die "Could not create $names[$_]: $!"; } local $_; print({$fhs[rand $files]} $_) || die "Unexpected write error: $!\n +" while <$in>; close($_) || die "Unexpected close error: $!\n" for @fhs; close($in) || die "Unexpected input close error: $!\n"; for (@names) { open(my $fh, "<", $_) || die "Could not open $_: $!"; big_shuffle($_, $fh, $out); unlink($_) || die "Could not unlink $_: $!"; } } die "Too many arguments. Usage: $0 [in_file [out_file]]\n" if @ARGV > +2; my ($in, $out) = @ARGV; if (defined($in) && $in ne "") { open(my $fh, "<", $in) || die "Could not open $in: $!"; $in = $fh; } else { $in = \*STDIN; } if (defined($out) && $out ne "") { open(my $fh, ">", $out) || die "Could not create $out: $!"; $out = $fh; } else { $out = \*STDOUT; } mkpath($dir); eval { big_shuffle($dir, $in, $out); close($out) || die "Unexpected output close error: $!\n" }; my $rc = $@; rmtree($dir); die $rc if $rc;