Re^5: Find duplicate elements from an array

Actually, I can.

I made quite a lot of money in the not-too distant past selling a database repair utility that ran hundreds of times faster than its competition. And the reason why “its competition” ran so slowly was that they, in order to verify that a secondary index was intact, simply read each one of the entries and did an indexed-lookup on the corresponding primary-key value to see if it actually matched. This was, of course, an entirely “sensible” approach, but it also turned out to be a fabulously inefficient one – which I realized, from years of teaching COBOL.

The key algorithm which I came up with, along with a few choice optimizations, was to extract the key information into a flat file, sort the two files and compare them linearly. And the difference in time was “hundreds of times faster.” The entire speed-difference was in disk I/O. (Importantly, both RAM-sizes and hardware speeds were much smaller then.)

But a very, very similar thing happened years later when I worked with a regional retail chain. Their pricing files are being updated constantly... and, quite randomly. They live and die by forecast models. Very simple things, including sorting the files that were going to be used as inputs against a (hive of Berkeley-DB tied hash) files, made a very big difference. (As did separating the thus-sorted streams into groups that each computer within a cluster could process independently, knowing that what it was doing would never step-on what any other computer was simultaneously doing. The “clustered” operations became truly-parallel, I think for the first time in its history.)

Every time you get a page-fault, it costs you an I/O and that costs you milliseconds. (If you start “thrashing,” you are now beating yourself up as you steal vitally-needed pages from yourself.) Every time you bounce down through an index-tree, except to the degree to which the index-page is already stashed, once again it costs you an I/O and milliseconds. By the billions or hundreds of billions, those “mere milliseconds” suddenly add up.

Algorithms that are purely based on memory, or upon indexed files, can drop to their knees, roll over and drop dead quite suddenly when they experience this “hits the wall” phenomenon. This can have an extremely big impact upon production operations. “We only added a few hundred thousand records, but the jobs telescoped from taking minutes to run, to taking many hours (that we don’t have!).” Although on-disk sorting can have its own slew of problems, especially as data-volumes creep from the sublime to the ridiculous (as they sometimes do), it is something that I have observed being completely overlooked. I do not, however, mean to imply that this is some silver bullet. Hardware is always getting faster, memory is getting bigger, and SANs seem to come from another planet with intelligent life.