grandpascorpion has asked for the wisdom of the Perl Monks concerning the following question:
Thanks for your insightmy $a = "AB:APPLE:BOY:ZOO:ZOOM"; my $b = "APPLE:ZOO"; my $c = "BOY: GIRL"; isSubset($b, $a) returns 1 # $b is a "subset" of $a. isSubset($c, $a) returns 0 # $c is not a "subset" of $a.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Computing Subsets
by davido (Cardinal) on May 27, 2012 at 22:55 UTC | |
You asked if there might be a more efficient approach than using Perl's hashes. In a very specific case of a really small set, there may be. But in the general case, probably not. While the common Perl idiom (and the one that is practically impossible to beat in the general case) is using a hash to represent sets, languages like C++ that (before C++x11) don't have a native hash container are forced to work with other strategies. That makes them good examples of learning other techniques in the absence of using a hash. I like seeing how something is usually accomplished in other languages -- often it brings me back to Perl's elegance. And just as often I find that I learn something about why Perl can make some solutions so elegant. In C++ you have the set and the multiset containers. Each is often implemented as a binary tree though there is no requirement that they do so; it just works out well that way. C++ sets and multisets maintain a "strict weak ordering", which pretty much means a sorted order for the elements based on how '<' is overloaded for the given data type (not really important for the discussion). C++ sets and multisets have a time complexity for individual inserts of O(log n), so building the entire set will be O(n log n), which turns out to be just about the same time complexity that one would get from taking an unordered array and sorting it. Finding one element is O(log n), but traversal over the complete set or multiset is O(n). The algorithm one in your situation would apply to a set, multiset, or even a sorted sequence container (array, vector, list, or deque) might be the adjacent_find function. It can walk the ordered set or ordered list in O(n) time, returning each occurrence of adjacent elements where the element on the left is not less than the one on the right -- ie, finding adjacent members that aren't relationally different. Another alternative would be to organize the superset, and then test the unordered subset elements. That would end up being a good optimization. So to reiterate the options in Standard C++ (pre-x11): The least expensive of those seems to be the first; building a set in O(n log n) time and rejecting (and annotating) duplicates as they turn up. That's just about as good as you can do in the general case if you eliminate hashing from the toolbox. One thing that is great about algorithms in general is that a rule that applies to one language for a given algorithm applies to any language for that same algorithm. Building an ordered set, ordered multi-set, or ordered list out of unordered data just can't be done in the general case in less than O(n log n) time. And there's no faster way to find duplicates on an ordered list than O(n) time. Finding duplicates in an unordered list is much worse. So the question is, is there a better algorithm? The answer is yes. A well designed hashing algorithm can create an unordered set in O(n) amortized time, and can facilitate lookups in O(1) amortized time. C++ had to rely on Boost until C++x11 came along to provide unordered sets with performance characteristics similar to Perl's hashes. But Perl has hashes. Building a hash as a container for an unordered set consumes O(n) time. Duplicates can be eliminated (and annotated) at creation time. If the hash is used as a HoA you can have unordered multi-set characteristics where creation time is still O(n) and searching for duplicates consumes O(n) time. So while it might be more efficient in a specific case to just call grep a bunch of times on a very small list, or to flex the muscles of Perl's regular expressions in some way, the underlying algorithm won't be as efficient in the general case, and is certainly not going to scale as well as using a hash based approach. Your specific question was this:
Your clever substitute for a hash requires sorting the two lists to prepare them for the test. The sort is going to consume O(n log n) time. That right there is the problem. Once they're sorted, there are a number of ways to test for duplicates in linear (O(n)) time. The machine you use to carry out the linear search isn't really all that important though, because you're suffering from the O(n log n) sort to begin with...unless your data set is already sorted. Dave | [reply] [d/l] [select] |
|
Re: Computing Subsets
by Kenosis (Priest) on May 28, 2012 at 00:27 UTC | |
Here's another option using smart matching:
Update: The subroutine below might be more efficient, since it's not building a list (the results are the same):
Here's an interesting read: How fast is Perl's smart match operator for searching scalar in an array?. UpdateUpdate: Ran a benchmark on three subroutines having different set/subset algorithms:
Results:
The original isSubset is the slowest, but isSubsetSmarter is surprisingly the fastest of the crew--at least with these small data sets. | [reply] [d/l] [select] |
by davido (Cardinal) on May 28, 2012 at 06:05 UTC | |
Re-run your benchmark after making a few changes: Removing say from the hash technique removes senseless IO operations that the other two techniques aren't testing. The rewrite of the hash technique internalizes the loop used to build the hash, and takes advantage of the fact that an exists check is cheaper than a value check. The increase of the data set size shows what the trend is as you try to upscale the different solutions. It's unlikely that the OP has a data set so small as only five elements in the haystack, and two in the needles set. In fact, it's unlikely that it's as small as 15 and 6. But it doesn't take much upscaling for the results of your benchmark to flip dramatically. Dave | [reply] [d/l] [select] |
by Kenosis (Priest) on May 28, 2012 at 06:41 UTC | |
Thank you, Dave, for seeing the (inadvertently placed and) disruptive say in the sub wrapper. I didn't mean for it to be in there, and it certainly would throw off the results. It's been a long day here... Yes, the results do change with a larger data set and a better hash algorithm:
Interestingly, if I change do { $count-- when @big } for @small; to do { $count-- if $_ ~~ @big } for @small; in isSubsetSmarter, the following results are produced:
It becomes apparent now, however, that using a hash for larger data sets in these cases improves performance. | [reply] [d/l] [select] |
by davido (Cardinal) on May 28, 2012 at 06:56 UTC | |
|
Re: Computing Subsets
by tobyink (Canon) on May 27, 2012 at 21:52 UTC | |
This seems to work...
... but I'd generally go with an array/hash solution.
perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
| [reply] [d/l] |
by grandpascorpion (Initiate) on May 27, 2012 at 22:03 UTC | |
| [reply] |
|
Re: Computing Subsets
by LanX (Saint) on May 28, 2012 at 17:57 UTC | |
see also Using hashes for set operations... Cheers Rolf | [reply] [d/l] |