in reply to Strategy for randomizing large files via sysseek

The following low tech solution may or may not work for you -- it trades millions of seeks for several sequential rewrites of the file (similar to BrowserUks solution) and doesn't require much code at all.

Rewrite the files with a random number prepended or appended to each line. Run the new files through GNU sort but be sure to use the -T flag to send temp files to a place where you have some free disk preferrably on a different physical disk, (enough for about 1.5 - 2x the size of the largest file), you could send the output through a pipe where a process then removes the random number from each line.

NOTE: I recall being able to sort text files close to a GB with GNU sort, in reasonable time, but I'm not sure how it handles files much large than this.

  • Comment on Re: Strategy for randomizing large files via sysseek

Replies are listed 'Best First'.
Re^2: Strategy for randomizing large files via sysseek
by BrowserUk (Patriarch) on Sep 10, 2004 at 01:49 UTC

    Why trade an O(2N) solution for an O(N)+O(N log N)* one?

    *At best


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      The main benefit of course is that there is less code to test/maintain, hence the "low tech" caveat. If it's fast enough for the OP, then it's worth considering. I'm sure the OP can figure this out easily by testing.

      On a lesser note I am not convinced that your solution will fairly shuffle the lines, even after a second pass -- though if the number of temp files and or passes is scaled with the number of total lines this may be sufficient for the OP's needs. (I.e. tempfiles^passes probably needs to equal or exceed total number of lines to sort, but that's a WAG).

        The main benefit of course is that there is less code to test/maintain,...

        I was kind of pleased with how simple the code was.

        On a lesser note I am not convinced that your solution will fairly shuffle the lines, even after a second pass...

        Indeed, you are correct, it does not produce a fair shuffle. It is capable (with a minor correction to the untested code) of producing every possible combination--which I equated in my mind with "fair".

        Thanks for calling me to book and making me think about it harder.

        However, given the OP's info of at least 1GB of data, of maximum length around 50 chars, then there are at least 20 million records. To produce all possible permutations of shuffle is 20000000!. With factorial( 1000 ) looking something like:

        And the processing taking around 15 minutes/GB, is unlikely that the lack of fairness is going to be a factor :).

        That said, the OP's requirement to remove duplicates (rather than just not produce them) means a sort is required anyway, so it all becomes moot.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      Well, your solution maybe linear - it doesn't remove duplicates, which is part of the requirement.
Re^2: Strategy for randomizing large files via sysseek
by Anonymous Monk on Sep 09, 2004 at 18:28 UTC
    (OP) Thanks to everyone. I am going to try this solution out. Will report back tomorrow on how it went.

    FWIW, the memory finally "maxed out" at around 700M and actually went down a little bit at around hour 18.