in reply to Re^3: Strategy for randomizing large files via sysseek
in thread Strategy for randomizing large files via sysseek

Now try it on a file with a million lines...

Update: Apparently, this is seen as sarcasm or otherwise unhelpful; so here are some statistics:

Using Tie:File and an inplace Fischer-Yates shuffle to sort various sized files:

  1. 100 lines: 58 milliseconds.
  2. 1,000 lines: 2 seconds
  3. 10,000 lines: 194 seconds
  4. 100,000 lines: after 3 1/2 hours of cpu I got sick of listening to the fan thrashing itself to death trying to keep the cpu cool, and aborted.
  5. 20, 000,000 (the OP's task): Probably best measured in half-lives of Plutonium.

The test code should anyone wish to verify my figures.

#! perl -slw use strict; use Tie::File; use Benchmark::Timer; our $N ||= 1000; sub shuffle { my $ref = @_ == 1 ? $_[ 0 ] : [ @_ ]; for( 0 .. $#$ref ) { my $p = $_ + rand( @{ $ref } - $_ ); @{ $ref }[ $_, $p ] = @{ $ref }[ $p, $_ ]; } return unless defined wantarray; return wantarray ? @{ $ref } : $ref; } open OUT, '>', 'junk.dat' or die $!; printf OUT "%030d\n", $_ for 0 .. $N; close OUT; my @lines; tie @lines, 'Tie::File', 'junk.dat'; my $T = new Benchmark::Timer; $T->start( "shuffle $N" ); shuffle \@lines; $T->stop( "shuffle $N" ); $T->report;

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Replies are listed 'Best First'.
Re^5: Strategy for randomizing large files via sysseek
by Anonymous Monk on Sep 15, 2004 at 13:02 UTC
    I did a
    time perl -pe '$r = substr(rand(), 2); $_ = "$r\t$_"' input | sort -n +| cut -f 2- > dev/null
    on the same files, and my results are:
    nr of lines          real       user        sys
            100       0m 0.010s  0m 0.010s   0m 0.000s
           1000       0m 0.051s  0m 0.010s   0m 0.030s
          10000       0m 0.264s  0m 0.180s   0m 0.040s
         100000       0m 2.608s  0m 1.640s   0m 0.140s
        1000000       0m40.640s  0m17.550s   0m 1.060s
       10000000      17m14.639s  3m30.830s   0m27.200s
    

      The post to which you've responded was meant to warn against using Tie::File for this kind of random rw access to very large files. For what is it good at, Tie:File is a brilliant module, I use it all the time--but this isn't what it is good at.

      In terms of your timings, they are pretty good. Here are some (rather crude) timing using a slightly corrected version of the code I posted elsewhere in this thread:

      1. 100 < 1 second
      2. 1,000 < 1 second
      3. 10,000 < 2 seconds
      4. 100,000 < 8 seconds
      5. 1,000,000 < 64 seconds
      6. 10,000,000 < 12m24secs

      I haven't got an accurate timing, but it did 1 billion records (32 GB) from/to compressed files in around 4 1/2 hours. Maximum memory used by any run is under 4 MB.

      That said, these were just a single pass, and as I pointed out elsewhere, you would probably need at least two runs to achieve a reasonable randomisation. Yours is much better in that respect I think--especially if the sort algorithm used is an unstable one. Strange to find an application that benefits from that.

      The OP also had the requirement to remove duplicate records from the input, which my approach won't achieve. That said, it would require two passes through the sort utility as well wouldn't it? One to remove the duplicates before prepending the random numbers and then re-sorting, so that would balance out.

      I did wonder whether instead of prepending a random number, sorting and then trimming, you could reverse the numbers, sort and re-reverse. That would randomise them pretty well for a one shot deal, but it isn't re-usable.

      Sorting on a randomly chosen character position in the records might also be an option. Saves prepending and cutting.

      Like all things, there are always several way, which is better often varies with the volumes involved, the tools available etc.

      The code and timings.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        I guess the size of your file is imiting. I wrote the code below, and got much better times than the sarcastic guy, but still much too slow for you I think. I based mine on sorting the indexes only, not the entire content of the line. However this does not address your problem of also removing duplicates. I'd be interested in testing what I can up with for your data file. If you want to post it somewhere let me know, and I'll see if I can beat your 4 hour mark.
        use strict; use Tie::File; use Benchmark::Timer; our $N ||= 100000; sub shuffle { my $shuffled = (); my $numlines = scalar @{$_[0]}; while ($numlines > 0) { my $randomLineNum = int rand $numlines; push(@$shuffled, $randomLineNum); splice(@{$_[0]}, $randomLineNum, 1); $numlines--; } return $shuffled; } open OUT, '>', 'junk.dat' or die $!; printf OUT "%030d\n", $_ for 0 .. $N; close OUT; my @lines; tie @lines, 'Tie::File', 'junk.dat'; my @indexList = (0..scalar @lines); my $T = new Benchmark::Timer; $T->start( "shuffling $N lines" ); my $newOrder = shuffle scalar \@indexList; $T->stop( "shuffling $N lines" ); $T->start( "Writing New ordered File" ); open NEW, ">sortedJunk.dat"; foreach my $lineNum(@$newOrder){ print NEW $lines[$lineNum],"\n"; } $T->stop( "Writing New ordered File" ); $T->report();

        and my results are pretty fast:
        numLines,time,writingTime
        100,281us total,15.222ms total
        1000,1.480ms total,57.048ms total
        10000,29.017ms total,559.466ms total
        100000,1.922s total,5.969s total
        1000000,256.212s total,154.443s total