Re^2: Find duplicate elements from an array

Replies are listed 'Best First'.
Re^3: Find duplicate elements from an array by locked_user sundialsvc4 (Abbot) on Nov 22, 2010 at 14:30 UTC
When data sizes become large, the fact that “memory is virtual” becomes extremely critical. When a system begins to “thrash,” actual running-times rise exponentially ... a phenomenon with an obvious “elbow” in the time-curve which is not-so-affectionately known as “hitting the wall.” (Or, if you prefer, “the windshield.” As in, “one of those big ugly bugs with very colorful and juicy guts that are specially evolved to direct themselves, at the moment of their demise, directly and precisely at the middle of the driver’s side of the glass, about halfway up.” Or maybe a migrating swarm of them, and yours is the lucky automobile. Ugh.) An algorithm based on external disk-sorting, then, will avoid that issue. The sort-program can be relied upon to use available memory efficiently, without overwhelming that resource when the file becomes very large. The programs that are subsequently written to use these known-to-be identically-sorted files are, as I said, “trivial.” It’s so important that there are companies (http://www.syncsort.com) that do nothing else but “sort huge things faster than anyone else knows how to do.” “When you need them, you will pay them.” The same strategy also shines when you are otherwise dealing with very large random-access disk files. Each access to such a file might entail one or more “seeks,” and this purely-physical consideration can drive completion times into the dungeon. An algorithm based on sorting can operate hundreds, or even thousands(!) of times faster. It won’t shine ... it may well be more-costly ... when one of these resource-driven constraints do not exist; as is often the case. Is this, then, “a silver-bullet strategy?” No, it is somewhat of a pain in the asterisk. Memory is so plentiful and cheap these days that it is often a non-issue (although this is also why a program that seems to be very successful in system-acceptance testing, falls flat on its asterisk in production). This is a strategy that you need to be familiar with. “When you need it, you will know.”
Re^4: Find duplicate elements from an array by JavaFan (Canon) on Nov 22, 2010 at 15:02 UTC
When my dataset is so large, it cannot be held in memory anymore, I very much prefer a linear algorithm using a disk tied hash (for instance, using db files) instead of sorting such a huge set. An algorithm based on sorting can operate hundreds, or even thousands(!) of times faster. Prove it. Where's your benchmark showing that the problem described by the OP is hundreds or thousands of times faster when sorting instead of using hashes.	[reply]
Re^5: Find duplicate elements from an array by locked_user sundialsvc4 (Abbot) on Nov 22, 2010 at 17:14 UTC
Actually, I can. I made quite a lot of money in the not-too distant past selling a database repair utility that ran hundreds of times faster than its competition. And the reason why “its competition” ran so slowly was that they, in order to verify that a secondary index was intact, simply read each one of the entries and did an indexed-lookup on the corresponding primary-key value to see if it actually matched. This was, of course, an entirely “sensible” approach, but it also turned out to be a fabulously inefficient one – which I realized, from years of teaching `COBOL`. The key algorithm which I came up with, along with a few choice optimizations, was to extract the key information into a flat file, sort the two files and compare them linearly. And the difference in time was “hundreds of times faster.” The entire speed-difference was in disk I/O. (Importantly, both RAM-sizes and hardware speeds were much smaller then.) But a very, very similar thing happened years later when I worked with a regional retail chain. Their pricing files are being updated constantly... and, quite randomly. They live and die by forecast models. Very simple things, including sorting the files that were going to be used as inputs against a (hive of Berkeley-DB tied hash) files, made a very big difference. (As did separating the thus-sorted streams into groups that each computer within a cluster could process independently, knowing that what it was doing would never step-on what any other computer was simultaneously doing. The “clustered” operations became truly-parallel, I think for the first time in its history.) Every time you get a page-fault, it costs you an I/O and that costs you milliseconds. (If you start “thrashing,” you are now beating yourself up as you steal vitally-needed pages from yourself.) Every time you bounce down through an index-tree, except to the degree to which the index-page is already stashed, once again it costs you an I/O and milliseconds. By the billions or hundreds of billions, those “mere milliseconds” suddenly add up. Algorithms that are purely based on memory, or upon indexed files, can drop to their knees, roll over and drop dead quite suddenly when they experience this “hits the wall” phenomenon. This can have an extremely big impact upon production operations. “We only added a few hundred thousand records, but the jobs telescoped from taking minutes to run, to taking many hours (that we don’t have!).” Although on-disk sorting can have its own slew of problems, especially as data-volumes creep from the sublime to the ridiculous (as they sometimes do), it is something that I have observed being completely overlooked. I do not, however, mean to imply that this is some silver bullet. Hardware is always getting faster, memory is getting bigger, and SANs seem to come from another planet with intelligent life.