Re: Computing Subsets

Here's another option using smart matching:

use Modern::Perl;

my $a = "AB:APPLE:BOY:ZOO:ZOOM";
my $b = "APPLE:ZOO";
my $c = "BOY:GIRL";

say isSubset( $b, $a );    #Returns 1
say !isSubset( $c, $a );   #Returns 1

sub isSubset {
    my @small = split ':', $_[0];
    my @big   = split ':', $_[1];
    @small ~~ [ grep $_ ~~ @big, @small ];
}
[download]

Update: The subroutine below might be more efficient, since it's not building a list (the results are the same):

sub isSubset {
    my $count = my @small = split ':', $_[0];
    my @big   = split ':', $_[1];
    do { $count-- when @big } for @small;
    !$count;
}
[download]

Here's an interesting read: How fast is Perl's smart match operator for searching scalar in an array?.

UpdateUpdate: Ran a benchmark on three subroutines having different set/subset algorithms:

use Modern::Perl;
use Benchmark qw(cmpthese);

my $a = "AB:APPLE:BOY:ZOO:ZOOM";
my $b = "APPLE:ZOO";
my $c = "BOY:GIRL";

sub isSubset {
    my @small = split ':', $_[0];
    my @big   = split ':', $_[1];
    @small ~~ [ grep $_ ~~ @big, @small ];
}

sub isSubsetSmarter {
    my $count = my @small = split ':', $_[0];
    my @big = split ':', $_[1];
    do { $count-- when @big } for @small;
    !$count;
}

sub isSubsetHash {
    my $count = my @small = split ':', $_[0];
    my %big;
    $big{$_} = 1 for split ':', $_[1];
    do { $count-- if $big{$_} } for @small;
    !$count;
}

cmpthese(
    -5,
    {
        isSubset        => sub { isSubset( $b,         $a ) },
        isSubsetSmarter => sub { isSubsetSmarter( $b,  $a ) },
        isSubsetHash    => sub { say isSubsetHash( $b, $a ) },
    }
);
[download]

Results:

                    Rate      isSubset    isSubsetHash isSubsetSmarter
isSubset        143452/s            --            -15%            -54%
isSubsetHash    169263/s           18%              --            -45%
isSubsetSmarter 309705/s          116%             83%              --
[download]

The original isSubset is the slowest, but isSubsetSmarter is surprisingly the fastest of the crew--at least with these small data sets.

Comment on Re: Computing Subsets Select or Download Code

Replies are listed 'Best First'.
Re^2: Computing Subsets by davido (Cardinal) on May 28, 2012 at 06:05 UTC
Re-run your benchmark after making a few changes: Remove the say from your `isSubsetHash` sub wrapper. Scrolling a whole bunch of 1's up the terminal during the benchmark is guaranteed to throw your results way off. Rewrite your `isSubsetHash` to look like this: `sub isSubsetHash { my $count = my @small = split ':', $_[0]; my %big; @big{ split( ':', $_[1] ) } = (); exists( $big{$_} ) && $count-- for @small; !$count; }` [download] Increase the size of your data set from two needles in a haystack of five, to six needles in a haystack of fifteen. Removing say from the hash technique removes senseless IO operations that the other two techniques aren't testing. The rewrite of the hash technique internalizes the loop used to build the hash, and takes advantage of the fact that an exists check is cheaper than a value check. The increase of the data set size shows what the trend is as you try to upscale the different solutions. It's unlikely that the OP has a data set so small as only five elements in the haystack, and two in the needles set. In fact, it's unlikely that it's as small as 15 and 6. But it doesn't take much upscaling for the results of your benchmark to flip dramatically. Dave	[reply] [d/l] [select]
Re^3: Computing Subsets by Kenosis (Priest) on May 28, 2012 at 06:41 UTC
Thank you, Dave, for seeing the (inadvertently placed and) disruptive `say` in the sub wrapper. I didn't mean for it to be in there, and it certainly would throw off the results. It's been a long day here... Yes, the results do change with a larger data set and a better hash algorithm: `Rate isSubset isSubsetSmarter isSubsetHash isSubset 59189/s -- -48% -54% isSubsetSmarter 112912/s 91% -- -13% isSubsetHash 129049/s 118% 14% --` [download] Interestingly, if I change `do { $count-- when @big } for @small;` to `do { $count-- if $_ ~~ @big } for @small;` in `isSubsetSmarter`, the following results are produced: `Rate isSubset isSubsetSmarter isSubsetHash isSubset 59390/s -- -50% -54% isSubsetSmarter 118219/s 99% -- -9% isSubsetHash 129268/s 118% 9% --` [download] It becomes apparent now, however, that using a hash for larger data sets in these cases improves performance.	[reply] [d/l] [select]
Re^4: Computing Subsets by davido (Cardinal) on May 28, 2012 at 06:56 UTC
...and the hash technique can be made slightly better: `sub isSubsetHash2 { my @small = split ':', $_[0]; my %big; @big{ split( ':', $_[1] ) } = (); return scalar( @small ) == grep { exists $big{$_} } @small; }` [download] This eliminates the counter variable, and takes advantage of the fact that in scalar (boolean) context grep doesn't bother generating a list, only a count. These are all micro-optimizations though. The most important optimization is to choose the proper algorithm. And in the case of finding set intersections, a hash-based algorithm is nearly always going to be the right one. The hash technique will keep on getting better and better, too, as compared to the other algorithms. As the size of `@small` increases, you would be doing more and more linear searches over `@big` with the smart match approaches. Yet for the hash approach, once the big hash is built, each element in `@small` can be checked in constant time. Try 1000 elements in @big and 400 in @small. Dave	[reply] [d/l] [select]