Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Randomize lines with limited memory

by natch (Sexton)
on Nov 01, 2003 at 21:53 UTC ( [id://303841]=perlquestion: print w/replies, xml ) Need Help??

natch has asked for the wisdom of the Perl Monks concerning the following question:

I usually use a script from the Perl Cookbook (first edition) when I need to randomize the lines in a file. It works great. The script loads the entire file into memory as an array, then uses a fischer-yates shuffle to randomize the array elements before printing them out.

But this time, my file is much larger and I'm wondering if there is a better method to use. The file has around 3.7 million entries, and is 227 megabytes in size. My machine has 512MB of memory, but I was afraid this was going to bring it to its knees. And in fact, when I tried, it quickly died with an "out of memory" error. Is there a better way to randomize the lines in a file, that works for files that would challenge the memory capacity of a machine? I can do the coding; I'm just wondering about what approaches people would suggest.

Replies are listed 'Best First'.
Re: Randomize lines with limited memory
by Corion (Patriarch) on Nov 01, 2003 at 22:23 UTC

    You can simply trade time for space here:

    1. Count the number of lines in your text file
    2. Generate a randomized order of lines by choosing a permutation of the line numbers (this will take about 32 bytes per line number).
    3. Rewind your text file.
    4. Read up to the line that is to become the next line written.
    5. Copy it into the target file.
    6. Repeat the three previous steps until the shuffled file is written out.

    Of course, this method won't work well if you always choose the same permutation, but you could also do a Fisher-Yates shuffle of the line numbers and then recreate the file from the line numbers as above.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web

      I'd probably use a variation on this. Read the file once saving the offset of each line. Shuffle the array of offsets. Step through the array, seeking to each offset in turn, and print the line at that offset.

      Using seek() will avoid all that rewinding and reading.

      By the way, you can do almost exactly this but without worrying about the nitty gritty. Use Tie::File. Just tie the file to @tied_file, shuffle the numbers 0 .. $#tied_file and print out each line in the array in shuffled order.

      -sauoq
      "My two cents aren't worth a dime.";
      
      Your method involves 3.7 million random file seeks.

      --
      TTTATCGGTCGTTATATAGATGTTTGCA

        I did mention that I trade space (memory) for time, didn't I?

        Also, my method dosen't involve random seeks, it only involves rewinding the file to the start point, which is a seek to a fixed position in the file. I think you could halve the number of times the file by checking whether the next line number is greater than the current line number and then winding forward to that one, as I think that there is a 50% chance that the next line number to be written is higher than the current number.

        That point is moot though, as Tie::File implements a caching scheme and is a much cleaner solution than my naive but memory conserving approach.

        perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
•Re: Randomize lines with limited memory
by merlyn (Sage) on Nov 02, 2003 at 00:28 UTC
    I question the need to randomize the file and rewrite it in the first place. Using something like DBD::SQLite, you can store this stuff into a structured flat file, then write a series of random numbers into a second column, and then select items from your entries column based on a sort of the second column (or write a shuffled index list into the second column).

    Anytime something sounds this hard, it's better to back up a step and say "what corner did I paint myself in to, and how did I get here?". Always question the context.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: Randomize lines with limited memory
by jdporter (Paladin) on Nov 01, 2003 at 22:14 UTC
    Divide and conquer. Chop the big file into as many files as it takes to make them a manageable size. Then randomize each of them normally (i.e. using the fisher-yates). The tricky part is the initial chopping up. You can't simply take the first 10k lines, then the second 10k lines, etc. That wouldn't give you adequate randomization (obviously). I think I would read each line, and choose an output file at random, and append that line to it. The files won't come out exactly the same size (except perhaps rarely), but that doesn't matter. You could also similarly randomize on the final joining step as well, but I'm not sure that would actually buy you anything. You could try it.

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.

Re: Randomize lines with limited memory
by Roger (Parson) on Nov 01, 2003 at 22:57 UTC
    I have written something similar to what you are trying to achieve, I rewrote an old script that had a very inefficient algorithm, with a new script with an efficient algorithm. The algorithm I used was similar to Corion's but different in detail. The algorithm is as follows:

    1. 1st pass through the file, build an array structure that records the position of the line number in the file, with the line number as the array index.
    2. randomize the line numbers, generate a new randomized list with line numbers in the new sequence.
    3. for each line number in the new sequence, look up the @linepos->[$line_number] => position in the array structure, seek to the desired position in the source file, read the line, and output the line in the new file.

    This algorithm is very fast, I once have reduced the time of processing of a file from 15+ minutes using the old and clumsy algorithm, to just under 3 seconds using the new efficient algorithm.

Re: Randomize lines with limited memory (Roll your own...)
by BrowserUk (Patriarch) on Nov 02, 2003 at 02:30 UTC

    ...or use mine:)

    This is one of those cases where rolling your own has benefits. The FAQ, Cookbook and the List::Util version of the Fischer-Yates shuffle all use the copy semantics. This means that you need over double the space required to store the data, in order to shuffle it.

    My version does an in-place shuffle, the benefits of which really show up when you start shuffling huge arrays. The following results are a comparision between my pure-perl, inplace shuffle and the List::Util XS version.

    P:\test>test -N=2000000 Pre-allocation 2696 kb (Memory use noted manually) Pre-inplace shuffle 43068 kb Post-inplace shuffle 43076 kb Post-XS_copy shuffle 91192 kb 1 trial of Inplace (41.760s total) 1 trial of Copied (48.400s total)

    The results show that my in-place version consumes just 8k extra ram to perform the shuffle, and takes about 15% less time to do it than the XS version. The XS version only takes around 15 seconds to actually perform the shuffle, but the copy semantics mean it loses this performance advantage by the need to allocate double the space, ending up considerably slower.

    It wouldn't be that hard (if you are an an accomplished XS programmer) to re-cast the List::Util version to detect that it was being given an array reference and was being called in a void context and switch to an in-place algorithm. Some crude tests seem to show that this would not only halve the memory usage, but as a result, would cut the overall shuffle time to less than a third.

    The benchmark (You'll need to use an external tool to measure the memory usage).

    #! perl -slw use strict; use List::Util qw[ shuffle ]; use Benchmark::Timer; our $N ||= 1_000_000; sub my_shuffle (\@) { my( $aref, $x ) = shift; for my $y ( 0 .. $#{ $aref } ) { $x = $y + rand( @{ $aref } - $y ); @$aref[ $y, $x ] = @$aref[ $x, $y ]; } } my $timer = new Benchmark::Timer; my @array; print 'Pre-allocation'; <STDIN>; push @array, $_ for 1 .. $N; print 'Pre-inplace shuffle'; <STDIN>; $timer->start('Inplace'); my_shuffle @array; $timer->stop('Inplace'); print 'Post-inplace shuffle'; <STDIN>; $timer->start('Copied'); my @shuffled = shuffle @array; $timer->stop('Copied'); print 'Post-XS_copy shuffle'; <STDIN>; $timer->report; __END__ P:\test>test -N=2000000 Pre-allocation Pre-inplace shuffle Post-inplace shuffle Post-XS_copy shuffle 1 trial of Inplace (41.760s total) 1 trial of Copied (48.400s total)

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!
    Wanted!

Re: Randomize lines with limited memory
by sauoq (Abbot) on Nov 01, 2003 at 23:18 UTC

    Here's an implementation of the suggestion I made in my reply to Corion...

    #!/usr/bin/perl -w use strict; use Tie::File; use List::Util qw( shuffle ); tie my @file, 'Tie::File', $ARGV[0] or die "$ARGV[0]: $!"; print $file[$_], "\n" for shuffle 0 .. $#file;
    Short and sweet. :-)

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Randomize lines with limited memory
by gjb (Vicar) on Nov 01, 2003 at 22:15 UTC

    If you don't really need a fair shuffle, you could get around the memory limitations by simply process the file by parts. Do a fair shuffle on each of the parts, and cat them together in random order.
    For better results, do this a number of times using a non-zero offset.

    Of course, you won't have a real fair shuffle, but maybe this is good enough.

    Hope this helps, -gjb-

Re: Randomize lines with limited memory
by scmason (Monk) on Nov 02, 2003 at 10:44 UTC
    This is simple. Two methods:

    1)Create an array with values 1 to 3.5 million. Randomize th.is array. Then start at the first element of the array, if it is (say) 1376, read the 1367th line from the file and write it to the new file. Then read the next number from the array, and it's corresponding line from the old file and so on.

    2)Read in the first 100,000 or so lines, randomize them and write them to a temp file. Read the second 100,000, randomize, write and so on. When all 3.5 million are in 35 new files, select a random number between 1 and 35 and then read the next line of that file and put it on the end of your final file. This way they are not in random order in 100,000 line chunks, but are more evenly distributed.

    The first method is just as 'random' as your normal method, because it is the same, but much slower. There are no memory problems though.

    It think could be shown that the second method is also rather 'random' but it is different and perhaps not as 'acceptable' until you did a formal proof.

Re: Randomize lines with limited memory
by exussum0 (Vicar) on Nov 02, 2003 at 13:18 UTC
    Tie::File should work for you. Especially since you aren't writting the file, my guess is it should be quite efficient :)
    But I haven't used it.

    Play that funky music white boy..
Re: Randomize lines with limited memory
by Anonymous Monk on Nov 03, 2003 at 21:57 UTC
    Buckets. Buckets buckets buckets.

    Open in file,
    open (n) out files,
    read line by line,
    output to out file rand(n)
    close all files
    append all out files together at end

    for better randomness, greater (n) is good, and operation can be repeated as many times as desired. Do it at least twice to avoid order bias.

    love the buckets

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://303841]
Approved by gjb
Front-paged by bart
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2024-03-28 21:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found