in reply to Find duplicate elements from an array

When I see, “match duplicate names,” and assuming a reasonable data-size, I immediately think of hash tables.   A hash whose key is the name, and whose value is an arrayref ... pointing to a list of page numbers (or, if you are also worried about duplicates there, another hashref.

If you have large data volumes, there is another excellent approach ... one that pre-dates the digital computer.   That approach is:   sorting.   If you sort the data-file, first by name and then by page-number (an operation which can be done “unexpectedly fast,” even for huge files), the resulting files have the following useful characteristics:

  1. All occurrences of any particular key-value will be adjacent.   So, you can detect “gaps” without searching.
  2. Entirely-duplicate records are also adjacent, so you can detect them by comparing the current record to the preceding one.
  3. If you need to compare, merge, diff, etc. two arbitrarily-large files, simply sort them both the same way and the problem becomes trivial.
  4. Most people want their output reports nicely sorted ... and presto, you’ve got that.
  5. No matter how much data you’ve got, it all takes a constant amount of memory and a linear amount of time.   (Excluding the sort-process itself, which might well be faster than the rest of it.)

Replies are listed 'Best First'.
Re^2: Find duplicate elements from an array
by JavaFan (Canon) on Nov 22, 2010 at 14:08 UTC
    If you have large data volumes, there is another excellent approach ... one that pre-dates the digital computer. That approach is: sorting.
    Wait, you're saying if the data set is large, instead of using an algorithm with a linear expected running time, you opt for one that's Ω(N log N)?
    (Excluding the sort-process itself, which might well be faster than the rest of it.)
    The larger the dataset is, the more dominant the sorting will be. Even if it's written in the best optimized way possible, running in C, if the dataset is large enough, Ω(N log N) will take more time than a linear algorithm that's driven by turtles.

    Not my kind of advice.

      When data sizes become large, the fact that “memory is virtual” becomes extremely critical.   When a system begins to “thrash,” actual running-times rise exponentially ... a phenomenon with an obvious “elbow” in the time-curve which is not-so-affectionately known as “hitting the wall.”   (Or, if you prefer, “the windshield.”   As in, “one of those big ugly bugs with very colorful and juicy guts that are specially evolved to direct themselves, at the moment of their demise, directly and precisely at the middle of the driver’s side of the glass, about halfway up.”   Or maybe a migrating swarm of them, and yours is the lucky automobile.   Ugh.)

      An algorithm based on external disk-sorting, then, will avoid that issue.   The sort-program can be relied upon to use available memory efficiently, without overwhelming that resource when the file becomes very large.   The programs that are subsequently written to use these known-to-be identically-sorted files are, as I said, “trivial.”

      It’s so important that there are companies (http://www.syncsort.com) that do nothing else but “sort huge things faster than anyone else knows how to do.”   “When you need them, you will pay them.”

      The same strategy also shines when you are otherwise dealing with very large random-access disk files.   Each access to such a file might entail one or more “seeks,” and this purely-physical consideration can drive completion times into the dungeon.   An algorithm based on sorting can operate hundreds, or even thousands(!) of times faster.

      It won’t shine ... it may well be more-costly ... when one of these resource-driven constraints do not exist; as is often the case.   Is this, then, “a silver-bullet strategy?”   No, it is somewhat of a pain in the asterisk.   Memory is so plentiful and cheap these days that it is often a non-issue (although this is also why a program that seems to be very successful in system-acceptance testing, falls flat on its asterisk in production).   This is a strategy that you need to be familiar with.   “When you need it, you will know.”

        When my dataset is so large, it cannot be held in memory anymore, I very much prefer a linear algorithm using a disk tied hash (for instance, using db files) instead of sorting such a huge set.
        An algorithm based on sorting can operate hundreds, or even thousands(!) of times faster.
        Prove it. Where's your benchmark showing that the problem described by the OP is hundreds or thousands of times faster when sorting instead of using hashes.