in reply to sorting type question- space problems

First, whichever way you go for sorting 20 GB of data, you will need quite a bit of disk space for intermediate storage. Even if you use a sort utility provided by your OS, it will create a number of temporary files (probably at least several hundred) on your disk. So the first thing your program should do is to check that there is ample disk space wherever the temp files will go (or you need to check manually before launching the sort). These sort utilities usually take care of removing temporary files when they don't need them anymore, but they might not be able to do it if they crash really badly.

Second, what you describe is not really what I would call sorting, but rather dispatching your data into 43 x 21 buckets (assuming from you example that there are 21 secondary keys) and then merging the buckets in the specific order of the keys, and this can be much faster than actual sorting.

I would suggest that you do create 43 x 21 = 903 files on a temporary directory on disk. You then just read your file once and dispatch the records into the proper files. This will require to open 900+ file handlers. Perl can handle without problem 1000 open filehandlers (you'll have to use an array or hash of filehandlers), so it should work; if you hit an operating system limit, then you'll have to go for two passes. Then, it is just a matter of merging back the files in the adequate order, and deleting these files as soon as you no longer need them. If it c rashes, all your files are in the same temp directory, no big deal to get rid of them.

I do not think that any other method can be faster than that.

  • Comment on Re: sorting type question- space problems