Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

RFC: Set::Select: get intersection or union of sets; or more generally, the set of elements that are in one or more input sets

by kikuchiyo (Hermit)
on Dec 12, 2018 at 21:58 UTC ( [id://1227173]=perlmeditation: print w/replies, xml ) Need Help??

If we have two sets, it is considered a solved problem to get their union or intersection or symmetric difference, there is even an entry in perlfaq4 about it. The situation is slightly more complicated if we have more than two input sets, because then it's a valid question to ask to e.g. get the set of elements that are in the first or second set but not in the third and fourth etc. The number of combinations grows rapidly with the number of input sets, and just writing ad hoc solutions to each little problem becomes infeasible. So a more general solution is needed - the hard part is designing the user interface so that it is able to express all the possible combinations of selections in a general and flexible, yet efficient and understandable manner. A cursory search of CPAN brought up several (abandoned?) modules in the Set::* namespace, but none of them was exactly what I needed.

I have the outline of an attempted solution. It's an OO module that has a constructor to which the input sets can be fed, and one method called select, which accepts a selector string argument and emits (an arrayref of) the elements that match the selector. If we have 3 input sets, then the '110' selector string selects all elements that are in the first and second sets but not in the third.

#!/usr/bin/perl { package Set::Select; use strict; use warnings; sub new { my ($class, $args, @sets) = @_; my $attr; $attr = $args->{key} if (ref $args eq 'HASH' and exists $args->{ke +y}); my $self; for my $i (0..$#sets) { for my $elem (@{$sets[$i]}) { my $key = defined $attr && ref $elem eq 'HASH' ? $elem->{$ +attr} : $elem; $self->{$key}->[1] //= $elem; $self->{$key}->[0] //= '0' x @sets; vec($self->{$key}->[0], $i, 8) = 0x31; } } bless $self, $class; } sub select { my ($self, $bits) = @_; return [map { $self->{$_}->[1] } grep { $self->{$_}->[0] =~ $bits +} keys %$self]; } } package main; use strict; use warnings; use Data::Dumper; my $x = Set::Select->new({}, [1, 3, 5, 7], [2, 3, 6, 7], [4, 5, 6, 7]) +; print Dumper $x->select($_) for qw/100 101 111 10. ... /; my $y = Set::Select->new({key => 'id' }, [{id => 1, value => 1}, {id => 3, value => 1}, {id => 5, value => +1}, {id => 7, value => 1}], [{id => 2, value => 2}, {id => 3, value => 2}, {id => 6, value => +2}, {id => 7, value => 2}], [{id => 4, value => 3}, {id => 5, value => 3}, {id => 6, value => +3}, {id => 7, value => 3}], ); print Dumper $y->select($_) for qw/100 101 111 10. ... /;

A Venn diagram that may or may not make the intent clearer:

         .---.
        /  1  \
       |       |
    .--+--. .--+--.
   /   | 3 X 5 |   \
  |    |  / \  |    |
  |  2  \/ 7 \/  4  |
  |     |`---'|     |
  |      \ 6 /      |
   \      \ /      /
    `------^------'

I think these selector strings as the primary (and only) user interface are better than the possible alternatives that come to mind: a verbose, ad hoc query language would have to be explained at length in the documention, tested carefully in the source, and parsed painfully at runtime, while a forest of arbitrarily named methods to select this or that subset would bloat the code needlessly and make it harder to use.

Using regular expressions opens the door to abuse, but it also allows convenient and terse selector strings, makes the implementation efficient, and it's something people already know.

If the elements are hashrefs (representing a record or object or something), there is a mode to use not the elements themselves but a named key inside them as the basis of selection, as the second example shows. This mode can be considered buggy as it is now, because only one version of a record with the same key is stored (in the example, some values are discarded. I don't have a good solution for this problem yet, partly because it would make the implementation slower and more complicated, partly because I don't know what would be the right thing to do.

Questions:

  • Is this useful to anyone?
  • How to make it better?
  • What would be a good name if this were to become a module? I've tentatively chosen Set::Select but it may be too generic.

Replies are listed 'Best First'.
Re: RFC: Set::Select: get intersection or union of sets; or more generally, the set of elements that are in one or more input sets
by shmem (Chancellor) on Dec 16, 2018 at 14:13 UTC
    If we have 3 input sets, then the '110' selector string selects all elements that are in the first and second sets but not in the third.

    This looks like an INNER JOIN operation as known from SQL. But there's no facility to select elements that are unique to each set, i.e. select elements from each set which aren't present in the other sets (which would be an OUTER JOIN, right?).

    So, the selector string should sport the values (012) or it should not be a string rather than an arrayref holding -1,0,1 elements. That would be more perlish (e.g.sort). I haven't thought of all combinations of intersecting three or more sets, which would require more than just three values for each set, or of a language to combine possible intersections of 3 or more sets; but I think that a good name would be Set::Intersect.

    Then, thinking a bit more - if you want the string interface, then a lispish sort of query specification might do the job, as used in LDAP queries - (|(&(1=0)(!(2=-1)))(3=1)(4=1)) - perhaps translated into "and", "or", "in", "not in" and such to make it more perlish. Or "and", "or", "not", "nor", "xor" etc - basically, you need boolean logic and precedence to process things, and a descriptive input specification which supports that.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      But there's no facility to select elements that are unique to each set, i.e. select elements from each set which aren't present in the other sets

      There is. '100' selects elements that are unique to the first set. '001|010|100' selects what you want.

      then a lispish sort of query specification might do the job, as used in LDAP queries - (|(&(1=0)(!(2=-1)))(3=1)(4=1)) - perhaps translated into "and", "or", "in", "not in" and such to make it more perlish. Or "and", "or", "not", "nor", "xor" etc - basically, you need boolean logic and precedence to process things, and a descriptive input specification which supports that.

      This is precisely what I didn't want to do. Lot of work, lot of edge cases, lot of tests and documentation to write.

Re: RFC: Set::Select: get intersection or union of sets; or more generally, the set of elements that are in one or more input sets
by kikuchiyo (Hermit) on May 24, 2023 at 16:07 UTC

    Released: Set::Select

    The interface of the published module is almost the same as the one outlined above, except I've removed the hash key mode, because it wasn't well defined. I've also added an all_subsets method that dumps all distinct non-empty subsets as a hash(ref).

    (And I've messed up the POD, because of course I did)

Re: RFC: Set::Select: get intersection or union of sets; or more generally, the set of elements that are in one or more input sets
by Laurent_R (Canon) on Dec 14, 2018 at 20:59 UTC
    Sounds quite good. Yes, I think this would be useful. I don't really have good answers to the other two questions.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://1227173]
Approved by haukex
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (8)
As of 2024-03-28 15:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found