It all has to do with the data volumes we are talking about, as I referred-to in my original response.
A hash-table, or a list, is of course an “in-memory data structure.” And that means virtual memory, of course, which is really not quite “memory” at all, but rather a very souped-up, hardware-accelerated, over-glorified disk file. This distinction is not particularly important until data volumes become large ... at which time the issue of “page faults” begins to rear its ugly head.
Virtual memory subsystems have one rather-strange characteristic: “they’re good, then they’re not-quite so good, then (boom!) they’re awful.” There is absolutely nothing in-between. If you plot a performance curve, you see a gradual mostly-linear progression, then (boom!) it “hits the wall” (the so-called thrash point) at which time the graph becomes exponential...
We are talkin’ Old Testament. I mean, real wrath of God stuff! 40 years of torment, earthquakes, tidal waves! The dead rising from the grave! World destruction, human sacrifices, Dogs and Cats Living Together....Mass Hysteria!!!
(Ahem...) I digress...
Disk-based sorting, on the other hand, with any current algorithm, is basically O(logxn). Which basically means, “it’s never gonna be exponential, and that’s really all that matters.” No matter how many millions of records you have, it’s going to take a predictable (and reasonable, considering...) amount of time to sort them all. And, once you have finished sorting, it will take one, linear pass through the sorted file(s) to finish the entire job.
If you have a “modest” amount of data to deal with, this technique might make no difference, and it might even be slower. But if you have large amounts of data, it will make all the difference in the universe.
Well, that might be a bit of an exaggeration... but at least we won’t have Dogs and Cats Living Together. And that, surely, is a good thing ...
Memo to File: From now on, do not undertake to make comments on PerlMonks after having watched both installments of Ghostbusters back-to-back in a single evening ...
| |
A hash-table, or a list, is of course an “in-memory data structure.” And that means virtual memory, of course, which is really not quite “memory” at all, but rather a very souped-up, hardware-accelerated, over-glorified disk file.
I'm sorry, but that's incorrect. The size of your VM is, of course, your memory plus your swap - but that does not mean "in-memory data structure == disk file". If, and only if, your actual memory is all filled up, then you start using swap - and even then it's a percentage game. Today, with memory on even a tiny laptop such as mine being up in the GB regions, this isn't a real issue even for data sets containing millions of records (although depending on the size of record you might be getting up there.)
So... you're actually not talking about hash lookups vs. sorting - this is more of a "if your VM is small enough and your data is large enough to get to the thrash point of the VM" projection. OK, thanks. :) I was wondering if I'd somehow misunderstood or missed something really basic regarding hash lookups. I appreciate the sanity check. :)
--
Education is not the filling of a pail, but the lighting of a fire.
-- W. B. Yeats
| [reply] |