in reply to Re^9: Sorting a (very) large file (memory)
in thread Sorting a (very) large file
Well, I'll try one more time, even though you don't seem to be paying much attention, IMHO.
Iterating over something that doesn't fit in physical memory just once only takes about as much page faulting as it takes to allocate and fill the big thing in the first place (and filling it takes about the same number of page faults whether it fits in memory or not; well, up to twice as many). So iterating over something once can take about twice as long if it doesn't fit in real memory (not a big deal).
Yes, a non-trivial "algorithm" on something that doesn't fit into real memory will often result in a lot more page faults. sort is O(N*log(N)) so it will visit most of the items being sorted about O(log(N)) times so the list of items not fitting into memory may make the sort run about log(N)-times slower than if the items did fit in memory (assuming that page faulting the list once takes quite a bit more time than the operations on the list required by sort). That is often a huge difference. You needed 8500000 items to reproduce the sizes involved so the slow-down can be propotional to log(8500000) (the "proportional" part means we can't use this theory to predict the multiplier but it could certainly be quite large).
My approach iterates over the list of lines just a couple of times. So the impact of the lines not fitting into real memory should only make my code run 2-to-4-times slower. Certainly that might be something worth spending more effort to reduce, but it can also certainly be a quite minor concern. And can certainly be a huge win over the cost of sorting the too-big list directly, especially when you add to the too-big list all of the overhead of a ton of anonymous arrays.
So, no, I don't follow the simplistic conclusion of "it doesn't fit into real memory, don't even bother".
So, an interesting consequence of these realizations is that my solution, even though it will use more memory than your quite sound solution, is that by sorting something that takes less memory (an array of numbers not an array of lines), my solution may require much fewer page faults. So using more memory can make things run faster when you don't have enough memory. But only if you use the memory wisely.
- tye
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^11: Sorting a (very) large file (bigger and faster?)
by samtregar (Abbot) on Nov 30, 2007 at 22:22 UTC |