Okay, since chunking is not allowed and you're noticing that working on small files is faster than big ones, I propose the following:
- Figure out, how many lines you're going to have or just define a $maxlines that's definitely going to be bigger, but not by too many orders.
- Go through your original data sequentially.
- Assign each line encountered an unused random number less than your $maxlines.
- Use the first part of this number as a file index (to determine the target file).
- Sequentially write the second part of the number as an line-index within the file, followed by the dataline into the target file.
- Add the random number to your used up number list.
- Individually sort each of the target files by the line-index.
- Remove the line-indexes from your target files.
- If the target files need to be different sizes, just go through them sequentially and create new ones with the lengths of your choice from them.
This way you'll do all the reading and writing sequentially, except for when you're sorting the target files. You can fine-tune their size by tweaking the size of the file-index as compared to that of the line-index. You might even want to keep your target files short enough, so each of them can be read into memory in one pass and sorted there, thereby virtually eliminating non-sequential disk access.
Have fun!
Update: Actually, I now realize this isn't unlike bluto's solution, except that it's much more explicit about not rewriting each file for itself.
Update: TilRMan's solution made me add the sentence about sorting in memory.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.