grandpascorpion has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm interested in finding a quick way to see if a set of words is a subset of another set of words. Doing this with arrays or hashes is simple. I was wondering if there's a way do it more efficiently via a clever regex (say by comparing two strings where the words are sorted and joined by a known delimiter). For instance:
my $a = "AB:APPLE:BOY:ZOO:ZOOM"; my $b = "APPLE:ZOO"; my $c = "BOY: GIRL"; isSubset($b, $a) returns 1 # $b is a "subset" of $a. isSubset($c, $a) returns 0 # $c is not a "subset" of $a.
Thanks for your insight

Replies are listed 'Best First'.
Re: Computing Subsets
by davido (Cardinal) on May 27, 2012 at 22:55 UTC

    You asked if there might be a more efficient approach than using Perl's hashes. In a very specific case of a really small set, there may be. But in the general case, probably not.

    While the common Perl idiom (and the one that is practically impossible to beat in the general case) is using a hash to represent sets, languages like C++ that (before C++x11) don't have a native hash container are forced to work with other strategies. That makes them good examples of learning other techniques in the absence of using a hash. I like seeing how something is usually accomplished in other languages -- often it brings me back to Perl's elegance. And just as often I find that I learn something about why Perl can make some solutions so elegant.

    In C++ you have the set and the multiset containers. Each is often implemented as a binary tree though there is no requirement that they do so; it just works out well that way. C++ sets and multisets maintain a "strict weak ordering", which pretty much means a sorted order for the elements based on how '<' is overloaded for the given data type (not really important for the discussion).

    C++ sets and multisets have a time complexity for individual inserts of O(log n), so building the entire set will be O(n log n), which turns out to be just about the same time complexity that one would get from taking an unordered array and sorting it. Finding one element is O(log n), but traversal over the complete set or multiset is O(n). The algorithm one in your situation would apply to a set, multiset, or even a sorted sequence container (array, vector, list, or deque) might be the adjacent_find function. It can walk the ordered set or ordered list in O(n) time, returning each occurrence of adjacent elements where the element on the left is not less than the one on the right -- ie, finding adjacent members that aren't relationally different. Another alternative would be to organize the superset, and then test the unordered subset elements. That would end up being a good optimization.

    So to reiterate the options in Standard C++ (pre-x11):

    • Build a set in O(n log n) time rejecting (and taking note of) duplicates as they occur.
    • Build a big multi-set in O(n log n) time, and then consume O(n) time applying the adjacent_find algorithm.
    • Sort a sequence container in O(n log n) time and then consume O(n) time applying the adjacent_find algorithm.

    The least expensive of those seems to be the first; building a set in O(n log n) time and rejecting (and annotating) duplicates as they turn up. That's just about as good as you can do in the general case if you eliminate hashing from the toolbox. One thing that is great about algorithms in general is that a rule that applies to one language for a given algorithm applies to any language for that same algorithm. Building an ordered set, ordered multi-set, or ordered list out of unordered data just can't be done in the general case in less than O(n log n) time. And there's no faster way to find duplicates on an ordered list than O(n) time. Finding duplicates in an unordered list is much worse.

    So the question is, is there a better algorithm? The answer is yes. A well designed hashing algorithm can create an unordered set in O(n) amortized time, and can facilitate lookups in O(1) amortized time. C++ had to rely on Boost until C++x11 came along to provide unordered sets with performance characteristics similar to Perl's hashes.

    But Perl has hashes. Building a hash as a container for an unordered set consumes O(n) time. Duplicates can be eliminated (and annotated) at creation time. If the hash is used as a HoA you can have unordered multi-set characteristics where creation time is still O(n) and searching for duplicates consumes O(n) time.

    So while it might be more efficient in a specific case to just call grep a bunch of times on a very small list, or to flex the muscles of Perl's regular expressions in some way, the underlying algorithm won't be as efficient in the general case, and is certainly not going to scale as well as using a hash based approach.

    Your specific question was this:

    I was wondering if there's a way do it more efficiently via a clever regex (say by comparing two strings where the words are sorted and joined by a known delimiter).

    Your clever substitute for a hash requires sorting the two lists to prepare them for the test. The sort is going to consume O(n log n) time. That right there is the problem. Once they're sorted, there are a number of ways to test for duplicates in linear (O(n)) time. The machine you use to carry out the linear search isn't really all that important though, because you're suffering from the O(n log n) sort to begin with...unless your data set is already sorted.


    Dave

Re: Computing Subsets
by Kenosis (Priest) on May 28, 2012 at 00:27 UTC

    Here's another option using smart matching:

    use Modern::Perl; my $a = "AB:APPLE:BOY:ZOO:ZOOM"; my $b = "APPLE:ZOO"; my $c = "BOY:GIRL"; say isSubset( $b, $a ); #Returns 1 say !isSubset( $c, $a ); #Returns 1 sub isSubset { my @small = split ':', $_[0]; my @big = split ':', $_[1]; @small ~~ [ grep $_ ~~ @big, @small ]; }

    Update: The subroutine below might be more efficient, since it's not building a list (the results are the same):

    sub isSubset { my $count = my @small = split ':', $_[0]; my @big = split ':', $_[1]; do { $count-- when @big } for @small; !$count; }

    Here's an interesting read: How fast is Perl's smart match operator for searching scalar in an array?.

    UpdateUpdate: Ran a benchmark on three subroutines having different set/subset algorithms:

    use Modern::Perl; use Benchmark qw(cmpthese); my $a = "AB:APPLE:BOY:ZOO:ZOOM"; my $b = "APPLE:ZOO"; my $c = "BOY:GIRL"; sub isSubset { my @small = split ':', $_[0]; my @big = split ':', $_[1]; @small ~~ [ grep $_ ~~ @big, @small ]; } sub isSubsetSmarter { my $count = my @small = split ':', $_[0]; my @big = split ':', $_[1]; do { $count-- when @big } for @small; !$count; } sub isSubsetHash { my $count = my @small = split ':', $_[0]; my %big; $big{$_} = 1 for split ':', $_[1]; do { $count-- if $big{$_} } for @small; !$count; } cmpthese( -5, { isSubset => sub { isSubset( $b, $a ) }, isSubsetSmarter => sub { isSubsetSmarter( $b, $a ) }, isSubsetHash => sub { say isSubsetHash( $b, $a ) }, } );

    Results:

    Rate isSubset isSubsetHash isSubsetSmarter isSubset 143452/s -- -15% -54% isSubsetHash 169263/s 18% -- -45% isSubsetSmarter 309705/s 116% 83% --

    The original isSubset is the slowest, but isSubsetSmarter is surprisingly the fastest of the crew--at least with these small data sets.

      Re-run your benchmark after making a few changes:

      • Remove the say from your isSubsetHash sub wrapper. Scrolling a whole bunch of 1's up the terminal during the benchmark is guaranteed to throw your results way off.
      • Rewrite your isSubsetHash to look like this:
        sub isSubsetHash { my $count = my @small = split ':', $_[0]; my %big; @big{ split( ':', $_[1] ) } = (); exists( $big{$_} ) && $count-- for @small; !$count; }
      • Increase the size of your data set from two needles in a haystack of five, to six needles in a haystack of fifteen.

      Removing say from the hash technique removes senseless IO operations that the other two techniques aren't testing.

      The rewrite of the hash technique internalizes the loop used to build the hash, and takes advantage of the fact that an exists check is cheaper than a value check.

      The increase of the data set size shows what the trend is as you try to upscale the different solutions. It's unlikely that the OP has a data set so small as only five elements in the haystack, and two in the needles set. In fact, it's unlikely that it's as small as 15 and 6. But it doesn't take much upscaling for the results of your benchmark to flip dramatically.


      Dave

        Thank you, Dave, for seeing the (inadvertently placed and) disruptive say in the sub wrapper. I didn't mean for it to be in there, and it certainly would throw off the results. It's been a long day here...

        Yes, the results do change with a larger data set and a better hash algorithm:

        Rate isSubset isSubsetSmarter isSubsetHash isSubset 59189/s -- -48% -54% isSubsetSmarter 112912/s 91% -- -13% isSubsetHash 129049/s 118% 14% --

        Interestingly, if I change do { $count-- when @big } for @small; to do { $count-- if $_ ~~ @big } for @small; in isSubsetSmarter, the following results are produced:

        Rate isSubset isSubsetSmarter isSubsetHash isSubset 59390/s -- -50% -54% isSubsetSmarter 118219/s 99% -- -9% isSubsetHash 129268/s 118% 9% --

        It becomes apparent now, however, that using a hash for larger data sets in these cases improves performance.

Re: Computing Subsets
by tobyink (Canon) on May 27, 2012 at 21:52 UTC

    This seems to work...

    use Test::More tests => 3; my $x = "AB:APPLE:BOY:ZOO:ZOOM"; ok isSubset('APPLE:ZOO' => $x); ok !isSubset('BOY:GIRL' => $x); ok isSubset('AB:BOY:ZOOM' => $x); sub isSubset { my ($small, $big) = @_; $small =~ s{:}{:(?:[^:]+:)*}xg; return($big =~ /^(?:[^:]+:)*$small(?::[^:]+)*/); }

    ... but I'd generally go with an array/hash solution.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      Thanks Toby! I'll give this a whirl.
Re: Computing Subsets
by LanX (Saint) on May 28, 2012 at 17:57 UTC
    Hashslices can be very efficient for sets of (non-empty) strings, so not sure why you need regexes for this...

    DB<124> @s1{a..c}=a..c => ("a", "b", "c") DB<125> @s2{c..e}=c..e => ("c", "d", "e") DB<126> %cut=%s1 => ("c", "c", "a", "a", "b", "b") DB<127> delete @s1{ keys %s2} => (undef, "c", undef) DB<128> delete @cut{ keys %s1} => ("a", "b") DB<129> keys %cut => "c"

    see also Using hashes for set operations...

    Cheers Rolf