If you have large data volumes, there is another excellent approach ... one that pre-dates the digital computer. That approach is: sorting.
Wait, you're saying if the data set is large, instead of using an algorithm with a linear expected running time, you opt for one that's Ω(N log N)?
(Excluding the sort-process itself, which might well be faster than the rest of it.)
The larger the dataset is, the more dominant the sorting will be. Even if it's written in the best optimized way possible, running in C, if the dataset is large enough, Ω(N log N) will take more time than a linear algorithm that's driven by turtles.
Not my kind of advice.
| [reply] |
When data sizes become large, the fact that “memory is virtual” becomes extremely critical. When a system begins to “thrash,” actual running-times rise exponentially ... a phenomenon with an obvious “elbow” in the time-curve which is not-so-affectionately known as “hitting the wall.” (Or, if you prefer, “the windshield.” As in, “one of those big ugly bugs with very colorful and juicy guts that are specially evolved to direct themselves, at the moment of their demise, directly and precisely at the middle of the driver’s side of the glass, about halfway up.” Or maybe a migrating swarm of them, and yours is the lucky automobile. Ugh.)
An algorithm based on external disk-sorting, then, will avoid that issue. The sort-program can be relied upon to use available memory efficiently, without overwhelming that resource when the file becomes very large. The programs that are subsequently written to use these known-to-be identically-sorted files are, as I said, “trivial.”
It’s so important that there are companies (http://www.syncsort.com) that do nothing else but “sort huge things faster than anyone else knows how to do.” “When you need them, you will pay them.”
The same strategy also shines when you are otherwise dealing with very large random-access disk files. Each access to such a file might entail one or more “seeks,” and this purely-physical consideration can drive completion times into the dungeon. An algorithm based on sorting can operate hundreds, or even thousands(!) of times faster.
It won’t shine ... it may well be more-costly ... when one of these resource-driven constraints do not exist; as is often the case. Is this, then, “a silver-bullet strategy?” No, it is somewhat of a pain in the asterisk. Memory is so plentiful and cheap these days that it is often a non-issue (although this is also why a program that seems to be very successful in system-acceptance testing, falls flat on its asterisk in production). This is a strategy that you need to be familiar with. “When you need it, you will know.”
| |
When my dataset is so large, it cannot be held in memory anymore, I very much prefer a linear algorithm using a disk tied hash (for instance, using db files) instead of sorting such a huge set.
An algorithm based on sorting can operate hundreds, or even thousands(!) of times faster.
Prove it. Where's your benchmark showing that the problem described by the OP is hundreds or thousands of times faster when sorting instead of using hashes.
| [reply] |