Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Strategy for randomizing large files via sysseek

by mhi (Friar)
on Sep 09, 2004 at 18:41 UTC ( [id://389832]=note: print w/replies, xml ) Need Help??


in reply to Strategy for randomizing large files via sysseek

Okay, since chunking is not allowed and you're noticing that working on small files is faster than big ones, I propose the following:

  1. Figure out, how many lines you're going to have or just define a $maxlines that's definitely going to be bigger, but not by too many orders.
  2. Go through your original data sequentially.
    1. Assign each line encountered an unused random number less than your $maxlines.
    2. Use the first part of this number as a file index (to determine the target file).
    3. Sequentially write the second part of the number as an line-index within the file, followed by the dataline into the target file.
    4. Add the random number to your used up number list.
  3. Individually sort each of the target files by the line-index.
  4. Remove the line-indexes from your target files.
  5. If the target files need to be different sizes, just go through them sequentially and create new ones with the lengths of your choice from them.
This way you'll do all the reading and writing sequentially, except for when you're sorting the target files. You can fine-tune their size by tweaking the size of the file-index as compared to that of the line-index. You might even want to keep your target files short enough, so each of them can be read into memory in one pass and sorted there, thereby virtually eliminating non-sequential disk access.

Have fun!

Update: Actually, I now realize this isn't unlike bluto's solution, except that it's much more explicit about not rewriting each file for itself.
Update: TilRMan's solution made me add the sentence about sorting in memory.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://389832]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-03-29 06:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found