Well, I'll try one more time, even though you don't seem to be paying much attention, IMHO.

Iterating over something that doesn't fit in physical memory just once only takes about as much page faulting as it takes to allocate and fill the big thing in the first place (and filling it takes about the same number of page faults whether it fits in memory or not; well, up to twice as many). So iterating over something once can take about twice as long if it doesn't fit in real memory (not a big deal).

Yes, a non-trivial "algorithm" on something that doesn't fit into real memory will often result in a lot more page faults. sort is O(N*log(N)) so it will visit most of the items being sorted about O(log(N)) times so the list of items not fitting into memory may make the sort run about log(N)-times slower than if the items did fit in memory (assuming that page faulting the list once takes quite a bit more time than the operations on the list required by sort). That is often a huge difference. You needed 8500000 items to reproduce the sizes involved so the slow-down can be propotional to log(8500000) (the "proportional" part means we can't use this theory to predict the multiplier but it could certainly be quite large).

My approach iterates over the list of lines just a couple of times. So the impact of the lines not fitting into real memory should only make my code run 2-to-4-times slower. Certainly that might be something worth spending more effort to reduce, but it can also certainly be a quite minor concern. And can certainly be a huge win over the cost of sorting the too-big list directly, especially when you add to the too-big list all of the overhead of a ton of anonymous arrays.

So, no, I don't follow the simplistic conclusion of "it doesn't fit into real memory, don't even bother".

So, an interesting consequence of these realizations is that my solution, even though it will use more memory than your quite sound solution, is that by sorting something that takes less memory (an array of numbers not an array of lines), my solution may require much fewer page faults. So using more memory can make things run faster when you don't have enough memory. But only if you use the memory wisely.

- tye        


In reply to Re^10: Sorting a (very) large file (bigger and faster?) by tye
in thread Sorting a (very) large file by condar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.