in reply to sorting very large text files
If your file has 40e6 records and it takes an hour to sort, the utility is succeeding in reading-sorting-writing 11 records every millisecond. Which is pretty good by anyones standards given the IO involved.
However, the example records you've posted are 96 chars, so 40e6 records would amount to just under 4GB. If your 15GB files contain similar records, then they'll have more like 165 million records. And if they take 1 hour then the utility is reading-sorting-writing 45 records per millisecond.
You simply aren't going to beat the highly optimised C code using Perl.
If you're doing this often enough to need to speed it up, then there are a few possibilities you could consider.
If you have more than one (local) drive on machine where this is happening, try to ensure that the output file is on a different drive (physical; not partition) to the input file.
Also check the effect of using -T, --temporary-directory=DIR to use a different physical drive for temporaries.
If you're doing this often, it might well be worth spending £200/£300 on a Solid State Disk.
These are orders of magnitude faster than harddisks and could yeild substantial speedup if used correctly.
If you have multi-core hardware, you might achieve some gains by preprocessing the file to split it into a few smaller sets, sort those on concurrent processes and concatenating the outputs.
Say you have 4 cpus available, and your records are fairly evenly split rangeing from 11_... to 99_..., then you could start (say) three child processes from your perl script, using piped-opens and then feed them records as you read them 11_... thru 39_... to the first; 40_... thru 69_... to the second; and 70_... onwards to the last. You then concatenate the output files from the three sorts to achieve the final goal.
Again, you'd need to experiment with the placement of the output files and number of processes to see what works best.
Anyway, just a few thoughts for consideration.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: sorting very large text files
by salva (Canon) on Dec 19, 2009 at 09:37 UTC | |
by BrowserUk (Patriarch) on Dec 19, 2009 at 11:00 UTC | |
by salva (Canon) on Dec 21, 2009 at 09:20 UTC | |
|
Re^2: sorting very large text files
by Anonymous Monk on Dec 20, 2009 at 04:29 UTC | |
by BrowserUk (Patriarch) on Dec 20, 2009 at 08:23 UTC | |
by Anonymous Monk on Dec 23, 2009 at 10:59 UTC |