If we have two sets, it is considered a solved problem to get their union or intersection or symmetric difference, there is even an entry in perlfaq4 about it. The situation is slightly more complicated if we have more than two input sets, because then it's a valid question to ask to e.g. get the set of elements that are in the first or second set but not in the third and fourth etc. The number of combinations grows rapidly with the number of input sets, and just writing ad hoc solutions to each little problem becomes infeasible. So a more general solution is needed - the hard part is designing the user interface so that it is able to express all the possible combinations of selections in a general and flexible, yet efficient and understandable manner. A cursory search of CPAN brought up several (abandoned?) modules in the Set::* namespace, but none of them was exactly what I needed.
I have the outline of an attempted solution. It's an OO module that has a constructor to which the input sets can be fed, and one method called select, which accepts a selector string argument and emits (an arrayref of) the elements that match the selector. If we have 3 input sets, then the '110' selector string selects all elements that are in the first and second sets but not in the third.
#!/usr/bin/perl { package Set::Select; use strict; use warnings; sub new { my ($class, $args, @sets) = @_; my $attr; $attr = $args->{key} if (ref $args eq 'HASH' and exists $args->{ke +y}); my $self; for my $i (0..$#sets) { for my $elem (@{$sets[$i]}) { my $key = defined $attr && ref $elem eq 'HASH' ? $elem->{$ +attr} : $elem; $self->{$key}->[1] //= $elem; $self->{$key}->[0] //= '0' x @sets; vec($self->{$key}->[0], $i, 8) = 0x31; } } bless $self, $class; } sub select { my ($self, $bits) = @_; return [map { $self->{$_}->[1] } grep { $self->{$_}->[0] =~ $bits +} keys %$self]; } } package main; use strict; use warnings; use Data::Dumper; my $x = Set::Select->new({}, [1, 3, 5, 7], [2, 3, 6, 7], [4, 5, 6, 7]) +; print Dumper $x->select($_) for qw/100 101 111 10. ... /; my $y = Set::Select->new({key => 'id' }, [{id => 1, value => 1}, {id => 3, value => 1}, {id => 5, value => +1}, {id => 7, value => 1}], [{id => 2, value => 2}, {id => 3, value => 2}, {id => 6, value => +2}, {id => 7, value => 2}], [{id => 4, value => 3}, {id => 5, value => 3}, {id => 6, value => +3}, {id => 7, value => 3}], ); print Dumper $y->select($_) for qw/100 101 111 10. ... /;
A Venn diagram that may or may not make the intent clearer:
.---.
/ 1 \
| |
.--+--. .--+--.
/ | 3 X 5 | \
| | / \ | |
| 2 \/ 7 \/ 4 |
| |`---'| |
| \ 6 / |
\ \ / /
`------^------'
I think these selector strings as the primary (and only) user interface are better than the possible alternatives that come to mind: a verbose, ad hoc query language would have to be explained at length in the documention, tested carefully in the source, and parsed painfully at runtime, while a forest of arbitrarily named methods to select this or that subset would bloat the code needlessly and make it harder to use.
Using regular expressions opens the door to abuse, but it also allows convenient and terse selector strings, makes the implementation efficient, and it's something people already know.
If the elements are hashrefs (representing a record or object or something), there is a mode to use not the elements themselves but a named key inside them as the basis of selection, as the second example shows. This mode can be considered buggy as it is now, because only one version of a record with the same key is stored (in the example, some values are discarded. I don't have a good solution for this problem yet, partly because it would make the implementation slower and more complicated, partly because I don't know what would be the right thing to do.
Questions:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RFC: Set::Select: get intersection or union of sets; or more generally, the set of elements that are in one or more input sets
by shmem (Chancellor) on Dec 16, 2018 at 14:13 UTC | |
by kikuchiyo (Hermit) on Dec 16, 2018 at 15:47 UTC | |
|
Re: RFC: Set::Select: get intersection or union of sets; or more generally, the set of elements that are in one or more input sets
by kikuchiyo (Hermit) on May 24, 2023 at 16:07 UTC | |
|
Re: RFC: Set::Select: get intersection or union of sets; or more generally, the set of elements that are in one or more input sets
by Laurent_R (Canon) on Dec 14, 2018 at 20:59 UTC |