The post to which you've responded was meant to warn against using Tie::File for this kind of random rw access to very large files. For what is it good at, Tie:File is a brilliant module, I use it all the time--but this isn't what it is good at.

In terms of your timings, they are pretty good. Here are some (rather crude) timing using a slightly corrected version of the code I posted elsewhere in this thread:

  1. 100 < 1 second
  2. 1,000 < 1 second
  3. 10,000 < 2 seconds
  4. 100,000 < 8 seconds
  5. 1,000,000 < 64 seconds
  6. 10,000,000 < 12m24secs

I haven't got an accurate timing, but it did 1 billion records (32 GB) from/to compressed files in around 4 1/2 hours. Maximum memory used by any run is under 4 MB.

That said, these were just a single pass, and as I pointed out elsewhere, you would probably need at least two runs to achieve a reasonable randomisation. Yours is much better in that respect I think--especially if the sort algorithm used is an unstable one. Strange to find an application that benefits from that.

The OP also had the requirement to remove duplicate records from the input, which my approach won't achieve. That said, it would require two passes through the sort utility as well wouldn't it? One to remove the duplicates before prepending the random numbers and then re-sorting, so that would balance out.

I did wonder whether instead of prepending a random number, sorting and then trimming, you could reverse the numbers, sort and re-reverse. That would randomise them pretty well for a one shot deal, but it isn't re-usable.

Sorting on a randomly chosen character position in the records might also be an option. Saves prepending and cutting.

Like all things, there are always several way, which is better often varies with the volumes involved, the tools available etc.

The code and timings.

#! perl -slw use strict; use List::Util qw[ shuffle ]; ## omit the glob if your shell expands wildcards for you. BEGIN{ @ARGV = shuffle map glob, @ARGV } print ~~localtime; my @temps; open $temps[ $_ ], '> :raw', "tmp/$_.tmp" for 0 .. 99; while( <> ) { printf { $temps[ rand 100 ] } $_; } print 'Temp files written'; do{ close $temps[ $_ ]; open $temps[ $_ ], '< :raw', "tmp/$_.tmp" or die "tmp/$_.tmp : $!" ; } for 0 .. 99; open FINAL, '> :raw', "randomised.all" or die $!; while( @temps ) { my $pick = int rand @temps; printf FINAL scalar readline( $temps[ $pick ] ); if( eof $temps[ $pick ] ) { close $temps[ $pick ]; undef $temps[ $pick ]; splice @temps, $pick, 1; } } close FINAL; print ~~localtime; __END__ P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 100 > jun +k P:\test>389660 junk Wed Sep 15 14:25:59 2004 Temp files written Wed Sep 15 14:25:59 2004 P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 1000 > ju +nk P:\test>389660 junk Wed Sep 15 14:26:08 2004 Temp files written Wed Sep 15 14:26:08 2004 P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 10000 > j +unk P:\test>389660 junk Wed Sep 15 14:26:17 2004 Temp files written Wed Sep 15 14:26:18 2004 P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 100000 > +junk P:\test>389660 junk Wed Sep 15 14:26:26 2004 Temp files written Wed Sep 15 14:26:33 2004 P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 1000000 > + junk P:\test>389660 junk Wed Sep 15 14:27:02 2004 Temp files written Wed Sep 15 14:28:05 2004 P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 20000000 +> junk Terminating on signal SIGINT(2) P:\test>perl -le"printf qq[%030d\n], $_ for 1 .. $ARGV[ 0 ]" 10000000 +> junk P:\test>389660 junk Wed Sep 15 14:30:17 2004 Temp files written Wed Sep 15 14:42:40 2004

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re^6: Strategy for randomizing large files via sysseek by BrowserUk
in thread Strategy for randomizing large files via sysseek by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.