in reply to Computing Subsets
You asked if there might be a more efficient approach than using Perl's hashes. In a very specific case of a really small set, there may be. But in the general case, probably not.
While the common Perl idiom (and the one that is practically impossible to beat in the general case) is using a hash to represent sets, languages like C++ that (before C++x11) don't have a native hash container are forced to work with other strategies. That makes them good examples of learning other techniques in the absence of using a hash. I like seeing how something is usually accomplished in other languages -- often it brings me back to Perl's elegance. And just as often I find that I learn something about why Perl can make some solutions so elegant.
In C++ you have the set and the multiset containers. Each is often implemented as a binary tree though there is no requirement that they do so; it just works out well that way. C++ sets and multisets maintain a "strict weak ordering", which pretty much means a sorted order for the elements based on how '<' is overloaded for the given data type (not really important for the discussion).
C++ sets and multisets have a time complexity for individual inserts of O(log n), so building the entire set will be O(n log n), which turns out to be just about the same time complexity that one would get from taking an unordered array and sorting it. Finding one element is O(log n), but traversal over the complete set or multiset is O(n). The algorithm one in your situation would apply to a set, multiset, or even a sorted sequence container (array, vector, list, or deque) might be the adjacent_find function. It can walk the ordered set or ordered list in O(n) time, returning each occurrence of adjacent elements where the element on the left is not less than the one on the right -- ie, finding adjacent members that aren't relationally different. Another alternative would be to organize the superset, and then test the unordered subset elements. That would end up being a good optimization.
So to reiterate the options in Standard C++ (pre-x11):
The least expensive of those seems to be the first; building a set in O(n log n) time and rejecting (and annotating) duplicates as they turn up. That's just about as good as you can do in the general case if you eliminate hashing from the toolbox. One thing that is great about algorithms in general is that a rule that applies to one language for a given algorithm applies to any language for that same algorithm. Building an ordered set, ordered multi-set, or ordered list out of unordered data just can't be done in the general case in less than O(n log n) time. And there's no faster way to find duplicates on an ordered list than O(n) time. Finding duplicates in an unordered list is much worse.
So the question is, is there a better algorithm? The answer is yes. A well designed hashing algorithm can create an unordered set in O(n) amortized time, and can facilitate lookups in O(1) amortized time. C++ had to rely on Boost until C++x11 came along to provide unordered sets with performance characteristics similar to Perl's hashes.
But Perl has hashes. Building a hash as a container for an unordered set consumes O(n) time. Duplicates can be eliminated (and annotated) at creation time. If the hash is used as a HoA you can have unordered multi-set characteristics where creation time is still O(n) and searching for duplicates consumes O(n) time.
So while it might be more efficient in a specific case to just call grep a bunch of times on a very small list, or to flex the muscles of Perl's regular expressions in some way, the underlying algorithm won't be as efficient in the general case, and is certainly not going to scale as well as using a hash based approach.
Your specific question was this:
I was wondering if there's a way do it more efficiently via a clever regex (say by comparing two strings where the words are sorted and joined by a known delimiter).
Your clever substitute for a hash requires sorting the two lists to prepare them for the test. The sort is going to consume O(n log n) time. That right there is the problem. Once they're sorted, there are a number of ways to test for duplicates in linear (O(n)) time. The machine you use to carry out the linear search isn't really all that important though, because you're suffering from the O(n log n) sort to begin with...unless your data set is already sorted.
Dave
|
|---|