Ultra compact set searching

I opted to post this as a meditation rather than a SoPW since I'm just relaying back what I've done. But if somebody wants to optimize, that'd rock.

I continued to tinker with Removing redundant powersets with minimal RAM an awful lot. If you're gonna read this node, go read that node (and the one it references) for background info). My original approach included storing a sorted version of all of the sets, from biggest set to smallest set. This would require 65 million lookup checks and ran in about 2.5 minutes. If I omitted the sort, it would drop to requiring only 45 million checks and running in about 1.5 minutes. Go figure. I think I got lucky with the data.

But my main question was, I have several sets of sets. Rapidly tell me if an arbitrary set is an element of each of those sets.. Alternatively, it can be stated as, Given a set of sets, and an arbitrary additional set, rapidly tell me if that arbitrary set is a subset of any set in my set of sets.

I have made progress towads a solution, but incredibly ended up optimizing even further for RAM, not speed. Whoops. I'm posting the progress here in case the techniques prove useful for anyone else. There's a potentially useful code generation technique about 3/4s of the way down. I can be a bit verbose.

Okay, now let's jump right into it. Let's say this is my data:


Set of sets:
ABCD
ABCE
AEF
DFG
ABFG

arbitrary set:
BG

Internally, this is all stored a integers, which I'll write out here i
+n binary:

ABCDEFG
1111000
1110100
1000110
0001011
1100011

arbitrary set:

0100001

I'm going to keep using the letters in the rest of the post, though, t
+hey're easier to read.
[download]

I want to quickly determine if BG is a subset of any of those sets I have. It turns out, it is, it's a subset of ABFG. But, I need to linearly run through all of the sets until I get to it, and it happens to be the one at the end. Not bad in this case, with only 5 sets, but if you had, say, 5,000 involved, it gets pricey. Especially if you need to loop through multiple times.

I was tempted to try sorting. Since they're all internally integers, this is fast and easy. Except for the fact that it's flawed. You don't want to sort on the size of the integer, you would want to sort on the number of elements in a set. So 01111 actually comes before 10000, since it has more elements. The goal was to put larger sets first, since they'd be more likely to contain a hit.

Unfortunately, I didn't get good results with it. Sure, some of the time, it helps out, but since the sets are arbitrary, you'll also end up sorting small sets to the end of the list. If those small sets contain the elements you need, you may have just increased your amount of work. Someone may hae a more clever technique.

I went with a different approach, trying to get constant time. Or, at a minimum, better performance.

The tricks I employed are to reduce the search space. In my previous post, I explained how I'd implemented a multi-level array to store the elements based upon what they contained, so you ended up with slots:


A: (ABCD, ABCE, AEF, ABFG)
B: (ABCD, ABCE, ABFG)
C: (ABCD, ABCE)
D: (ABCD, DFG)
E: (ABCE, AEF)
F: (AEF, DFG, ABFG)
G: (DFG, ABFG)

Then, with our arbitrary set, BG, we just look at the sets in the B sl
+ot and the G slot.

B: (ABCD, ABCE, ABFG)
G: (DFG, ABFG)

If there's a superset, it's gotta be in one of those two slots. The ne
+xt trick is that we only look at the smaller of the two lists, to try
+ and minimize our work as much as possible:

G: (DFG, ABFG)

So now we only need to check and see if BG is a subset of DFG or ABFG.
+ That's 2 checks instead of 5. Progress.
[download]

But I'm never happy. I wanted faster. I wanted a way to look at the sets and figure out if anything matched. I'll segue to binary again.


ABCDEFG
1111000
1110100
1000110
0001011
1100011

arbitrary set:

0100001

Now, I'll transpose the matrix, since I think it makes it easier to ex
+plain.

  SSSSS
  eeeee
  ttttt
  12345
 +-----
A|11101
B|11001
C|11000
D|10010
E|01100
F|00111
G|00011

So instead of listing the sets horizontally, I list them vertically. E
+ach horizontal row is an element. Internally, each of those horizonta
+l numbers there is stored as a new integer. So instead of storing "11
+11000" for the set "ABCD", I store "11101" for the set of all sets ta
+ht contain "A".
[download]

This was the eureka moment. Each horizontal column contains the list of sets that each element is in. So if you take any two elements and bitwise & them together, you can see if there's an element that's in multiple sets. Again, I'll use the same trick as before to whittle down my search.


  SSSSS
  eeeee
  ttttt
  12345
 +-----
A|11101
B|11001
C|11000
D|10010
E|01100
F|00111
G|00011

Arbitrary set is BG. So take the B row and the G row.

  SSSSS
  eeeee
  ttttt
  12345
 +-----
B|11001
G|00011
[download]

Now you & those two integers together. 11001 & 00011 = 00001. This is a true value (not the null character). Therefore, we know that BG is a subset of some previous set. It doesn't matter which one, and for this problem, we don't care. Existence is enough.

You, the human, can read it and see that the value is 00001, so that means it's the 5th set that is a superset, which is ABFG. You've just dropped down to needing just a single check to determine if your arbitrary set is a subset of the 5 previous ones. Naturally, this is bounded by the size of your ints (normally 32 bits. 64 if you're lucky!). In that case, I build an array of ints and check each set in turn.

Now, that anding was a bitch. Conceptually, it's easy to see that I need to & together two ints in this case. But the next time I may need to & together three ints. And the next time 15 ints. And so on. I didn't want to go looping over the structure and & each value together in turn, so instead I wrote up this little gizmo to generate a function to & together arbitrary elements.


my %dispatch = ();

sub and_maker {

    my $num_args = shift;

    return $dispatch{$num_args} if $dispatch{$num_args};

    my $built = join '&', map {'$_[' . $_ . ']'} (0..$num_args - 1);
    return $dispatch{$num_args} = eval "sub { $built }";
}

Call as:

my $func = and_maker(3);
my $results1 = $func->(1,2,3);                # 3 args
my $results2 = $func->(5,6,7);                # 3 args
my $func2 = and_maker(5);
my $$results3 = $func2->(12,13,14,15,16);    # 5 args
[download]

Performance is a bit twitchy, though. In benchmarking, iteration over the elements was faster with very small sets, less than 10 elements. Then this approach became faster than iteration for sets up to about 4,000 elements. Then iteration became faster again after that. I assume there's some sort of internal limit that gets hit, but dunno for sure.

And the results of this long lecture? Bupkiss. Well, not what I wanted, at least. Despite all my efforts to reduce the number of checks, it didn't help.

Now, the number of checks total were greatly diminished. Only 14 million vs. 45 million in the previous version (the version I'd originally written back in the original thread only requires 2.5 million), but the check was more complicated (a big & statement across multiple values vs &ing two values.

And the average number went up. My previous worst case was 833 checks, and now it's about 120. But I need to check a lot more elements 120 times than I needed to check 833 times in the prior version. So the runtime is the same, a little over 2 minutes.

What I accidentally did was dramatically reduce the ram. For the data set that Limbic~Region provided, my last version required storing 32,000 some odd integers (note my last post has a typo, where I referred to it as 32,000 bytes). This version? It only needs to store 3,200 integers. That's 1/10th the ram. So I can do the full on huge example of 10,000,000 sets with 24 elements each in only 100 megs of ram (again, my original post was in error - it's not 240megs, it's 960. Assuming minimal int size and no other perl funniness, which I know is inaccurate.

I can clean up and post the code if anyone's interested in pursuing this further. I'm still going to monkey with it. I feel like it's very close to a breakthrough, but haven't figured it out yet.

Comment on Ultra compact set searching Select or Download Code

Replies are listed 'Best First'.
Re: Ultra compact set searching by kyle (Abbot) on Nov 20, 2006 at 19:50 UTC
I don't know how you're doing the iterated version of and_maker, but if it's not List::Util::reduce, you might try that. `use List::Util qw( reduce ); my $result = reduce { $a & $b } @int_lists_great_and_small;` [download]	[reply] [d/l]
Re: Ultra compact set searching by radiantmatrix (Parson) on Nov 22, 2006 at 17:57 UTC
I can't see your `and_maker`, but based on its interface, I'm guessing you're actually building a sub. I suppose that works, but instead, what about: `sub multi_and { my $result = shift; while (@_) { $result &= shift; } $result; }` [download] That might be used like: `my $results1 = multi_and(1,2,3); # 3 args my $results2 = multi_and(5,6,7); # 3 args my $results3 = multi_and(12,13,14,15,16); # 5 args` [download] No code generation, should be pretty fast. If the repeated shift is a problem, then this should work, too: `sub multi_and { my $result = shift; for (@_) { $result &= $_ } $result }` [download] <–radiant.matrix–> Ramblings and references The Code that can be seen is not the true Code I haven't found a problem yet that can't be solved by a well-placed trebuchet	[reply] [d/l] [select]
Re^2: Ultra compact set searching by jimt (Chaplain) on Nov 22, 2006 at 18:34 UTC
I'll give you one very good reason to build the sub - performance. The whole point of building the sub is not to do any iterating. You build the sub with a set number of arguments, then you don't need to loop. As I'd said, performance is twitchy, but for my test cases, it blew the iterative multi_and out of the water. #!/usr/bin/perl use Benchmark; sub multi_and { my $result = shift; for (@_) { $result &= $_ } $result } my %dispatch = (); sub and_maker { my $num_args = shift; return $dispatch{$num_args} if $dispatch{$num_args}; my $built = join '&', map {'$_[' . $_ . ']'} (0..$num_args - 1); return $dispatch{$num_args} = eval "sub { $built }"; } # I'll go ahead and pre-cache these my $and_maker_5 = and_maker(5); my $and_maker_50 = and_maker(50); my $and_maker_500 = and_maker(500); my $and_maker_5000 = and_maker(5000); my $and_maker_50000 = and_maker(50000); my @list_5 = (1..5); my @list_50 = (1..50); my @list_500 = (1..500); my @list_5000 = (1..5000); my @list_50000 = (1..50000); timethese(1000000, { 'and_maker 5 elements' => sub { $and_maker_5->(@list_5); }, 'multi_and 5 elements' => sub { multi_and(@list_5); }, }); print "----\n"; timethese(100000, { 'and_maker 50 elements' => sub { $and_maker_50->(@list_50); }, 'multi_and 50 elements' => sub { multi_and(@list_50); }, }); print "----\n"; timethese(10000, { 'and_maker 500 elements' => sub { $and_maker_500->(@list_500); }, 'multi_and 500 elements' => sub { multi_and(@list_500); }, }); print "----\n"; timethese(1000, { 'and_maker 5000 elements' => sub { $and_maker_5000->(@list_5000); }, 'multi_and 5000 elemets' => sub { multi_and(@list_5000); }, }); print "----\n"; timethese(100, { 'and_maker 50000 elements' => sub { $and_maker_50000->(@list_50000); }, 'multi_and 50000 elemets' => sub { multi_and(@list_50000); }, }); [download] And the results: Benchmark: timing 1000000 iterations of and_maker 5 elements, multi_an +d 5 elements... and_maker 5 elements: 0 wallclock secs ( 0.95 usr + 0.00 sys = 0.95 + CPU) @ 1052631.58/s (n=1000000) multi_and 5 elements: 2 wallclock secs ( 2.33 usr + 0.01 sys = 2.34 + CPU) @ 427350.43/s (n=1000000) ---- Benchmark: timing 100000 iterations of and_maker 50 elements, multi_an +d 50 elements... and_maker 50 elements: 0 wallclock secs ( 0.47 usr + 0.00 sys = 0.4 +7 CPU) @ 212765.96/s (n=100000) multi_and 50 elements: 1 wallclock secs ( 1.06 usr + 0.01 sys = 1.0 +7 CPU) @ 93457.94/s (n=100000) ---- Benchmark: timing 10000 iterations of and_maker 500 elements, multi_an +d 500 elements... and_maker 500 elements: 1 wallclock secs ( 0.68 usr + 0.00 sys = 0. +68 CPU) @ 14705.88/s (n=10000) multi_and 500 elements: 1 wallclock secs ( 0.94 usr + 0.00 sys = 0. +94 CPU) @ 10638.30/s (n=10000) ---- Benchmark: timing 1000 iterations of and_maker 5000 elements, multi_an +d 5000 elemets... and_maker 5000 elements: 1 wallclock secs ( 0.87 usr + 0.00 sys = 0 +.87 CPU) @ 1149.43/s (n=1000) multi_and 5000 elemets: 1 wallclock secs ( 0.94 usr + 0.00 sys = 0. +94 CPU) @ 1063.83/s (n=1000) ---- Benchmark: timing 100 iterations of and_maker 50000 elements, multi_an +d 50000 elemets... and_maker 50000 elements: 1 wallclock secs ( 0.97 usr + 0.01 sys = +0.98 CPU) @ 102.04/s (n=100) multi_and 50000 elemets: 1 wallclock secs ( 0.96 usr + 0.00 sys = 0 +.96 CPU) @ 104.17/s (n=100) [download] The iterative multi_and didn't overtake it until I reached a list with 50,000 elements. Performance may vary with the elements in the list, though, I suppose. You may be able to improve performance by having it bow if it &s down to 0, but the added overhead of the conditional may negate the improvement on average.	[reply] [d/l] [select]