in reply to Re^2: When the input file is huge !!!
in thread When the input file is huge !!!
The merge is likely somewhat slower than you describe. The problem is how parallel read/write streams the drive can handle without resorting to streaming. My memory says that the limit is 4. Which means that you can at most read 3 chunks and write 1. If you try to do more, then your disk starts to thrash. If that is right, then it can handle 16 hunks in 3 passes through the dataset, where each pass has to both read and write all of the data. That is in addition to the original pass that sorts it into chunks. That means the dataset has to be streamed 8 times (4 times from disk, 4 times to). If streaming takes 1 minute per pass (that was what the OP said about his dataset) that would be 8 minutes. That is substantially faster than what I said, but I like to be conservative when I tell people that things will be slow because I don't want them to prematurely give up if it is sower than they expect.
Also note that a pure Perl implementation probably will have trouble sorting 500 MB chunks of data in RAM because of the memory overhead of internal data structures. However if you are reading from and writing to the filesystem, you can wind up creating, writing, reading and deleting a file with all operations happening in RAM without getting flushed to disk. If intelligently used, this fact can greatly increase performance because most passes do not touch disk. This can make a fairly naive merge-sort perform reasonably close to a much more optimized one.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: When the input file is huge !!!
by BrowserUk (Patriarch) on Jan 06, 2009 at 21:46 UTC | |
by tilly (Archbishop) on Jan 07, 2009 at 03:01 UTC | |
by BrowserUk (Patriarch) on Jan 07, 2009 at 04:03 UTC | |
by tilly (Archbishop) on Jan 07, 2009 at 21:08 UTC |