in reply to better union of sets algorithm?

If your values are integers in a sane range (ie. reasonably low) you can save a little time, upto around 60% or so, by using a bit vector instead of a hash:

#! perl -slw use strict; use Benchmark qw[ cmpthese ]; our $B ||= 32; our $N ||= 1000; our $S ||= 100; our $R ||= $N * $S; our @sets = map{ [ map int rand $R, 1 .. $N ] } 1 .. $S; our( @hUniq, @vUniq ); cmpthese -2, { hash => q[ my %seen; undef @hUniq; @hUniq = grep{ $seen{ $_ }++ } @$_ for @sets; ], vec => q[ my $vector = ''; undef @vUniq; @vUniq = grep{ vec( $vector, $_, $B )++ } @$_ for @sets; ], }; print 'H:'. @hUniq, ' V: '. @vUniq; __END__ P:\test>438536 Rate hash vec hash 2.69/s -- -23% vec 3.47/s 29% -- H:954 V: 954 P:\test>438536 -S=20 Rate hash vec hash 16.4/s -- -41% vec 27.5/s 68% -- H:639 V: 639 P:\test>438536 -S=20 -N=100 Rate hash vec hash 264/s -- -21% vec 336/s 27% -- H:61 V: 61 P:\test>438536 -S=20 -N=2000 Rate hash vec hash 7.33/s -- -36% vec 11.5/s 56% -- H:1407 V: 1407 P:\test>438536 -S=20 -N=2000 -B=16 Rate hash vec hash 7.06/s -- -40% vec 11.8/s 67% -- H:1399 V: 1399

If you could build and manipulate your sets as bit vectors, by storing your uniq values in an hash as you get them and setting bits in the vectors to represent them, ORing the bit vectors will get your union. You then use the set-bit positions as indexes into your array of unique values to reconstitute the union set.

It works for any type of values and is very fast provided you can build and work with your sets that way.


Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.

Replies are listed 'Best First'.
Re^2: better union of sets algorithm?
by Anonymous Monk on Mar 11, 2005 at 10:53 UTC
    Only if they are non-negative integers....

    (Well, theoretically, for any set for which you have a fast, 1-to-1 mapping to the set of non-negative integers, this method will work. For instance, if all your members are negative integers, multiplying all members with -1 gives you positive integers to put inside the bit string)

      Only if they are non-negative integers...
      The positive integers can be mapped 1-to-1 on the entire set of integers (both sets have the same cardinality). You can construct a mapping so that the bit index is unique for any integer. For example:
      my $index = ( $integer >= 0 ) ? 2 * $index : -2 * $index - 1;
      Then $index is non-negative even for positive integers, and non-negative odd for negative integers.

      BTW, there is an interesting idea for Bottle Golf at Aleph-0 on Mathworld.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re^2: better union of sets algorithm?
by perrin (Chancellor) on Mar 11, 2005 at 18:48 UTC
    I had the same thought when reading through Mastering Algorithms with Perl, but I'm actually working with strings here. I can't think of a cheap way to map strings to bits that would work for this. It may be worth trying to keep track of integers for these since I could probably do it cheaply as they are added to my database and then the union could be done this way.

      Yes. The mapping is the crux of the issue.

      If your doing the unions (or intersections, sym.diffs), on a regular basis, then it can be worth the effort of building a uniq index (offline) and replacing your sets of strings with bitvectors mapped against that index.

      You then hold and maintain your sets as bitvectors and all the set manipulations become easy and efficient, except adding (and to lesser extent displaying), which requires mapping.

      The index doesn't need to be ordered in any way, just unique. though ordering them does allow for the use of a binary chop for lookups when adding (or displaying).

      Whether the offline work of mapping can be amortised to effect an overall saving depends on how often your sets change and how often you need to do the unions.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco.