Finding an intersection of two sets, lean and mean

This will find the intersection of two sets. That, by it self, is nothing new... my problem however is space vs. time. I don't want the trade-off, I want it to be fast and small whereas the CPAN modules providing the means I want are always small && slow OR huge && fast I use this to apply blacklists to address-files. The below snippet is a fully functional program. Play around with $keyLength to see increase in performance (or not)

open(local *BLACKLIST, "<blacklist");
open(local *ADDRESS, "<address");

@blacklist = <BLACKLIST>;
@address = <ADDRESS>;

my $sep = "|";
my $keyLength = 6;

my @blackListed = intersection(\@blacklist, \@address);

print "Found: " . scalar(@blackListed) . "\n";

sub intersection {
  my ( $list1, $list2 ) = @_;

  my %strings;
  my $loop;

  # turn the biggest list into a strings hash.
  # AND loop through the smallest list.
  if ( $#{$list1} > $#{list2} ) {
    %strings = makeStringsHash( \$list1 );
    $loop = \$list2;
  } else {
    %strings = makeStringsHash( \$list2 );
    $loop = \$list1;
  }

  # run through the smallest of lists
  my @intersection = ();

  # for each key remember the last position ( the strings in the hash
  # are sorted, remember? )
  my %lastPos = ();
  foreach my $entry ( @{ $$loop } ) {

    my $key = substr($entry, 0, $keyLength);

    my $pos = $lastPos{$key} || 0;
    my $tmp = index( $strings{$key}, $sep.$entry.$sep, $pos );

    # if we found it in the big-list, add it to the intersection
    if ( $tmp != -1 ) {
      push @intersection, $entry;
      $lastPos{$key} = $tmp;
    }
  }

  return @intersection
}

sub makeStringsHash {
  my ( $list ) = @_;

  my %strings = ();
  $strings{substr($_, 0, $keyLength)} .= $sep . $_ . $sep foreach ( so
+rt @$$list );

  return %strings;
}
[download]

Comment on Finding an intersection of two sets, lean and mean Download Code

Replies are listed 'Best First'.
Re: Finding an intersection of two sets, lean and mean by dragonchild (Archbishop) on Jun 29, 2005 at 14:20 UTC
Uhh ... you do realize that you just implemented a very rudimentary hash table in Perl using ... wait for it ... hashes! Except, you don't handle collisions. What's wrong with: `sub intersection { my ( $list1, $list2 ) = @_; my %hash; @hash{@$list1} = undef; my @intersection = grep { exists $hash{$_} } @$list2; return @intersection }` [download] Update: Fixed mistake in my code (map -> grep) Compare mine to yours and I think you'll notice a difference. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l]
Re^2: Finding an intersection of two sets, lean and mean by Sinister (Friar) on Jun 29, 2005 at 14:31 UTC
Allas, this doesn't work... I tried this in my test-set and the result is that @intersection = @$list2, which is _not_ true. I can't seem to figure out why this happens yet. Your example should be perfectly legal. I know. It is only much slower	[reply]
Re^3: Finding an intersection of two sets, lean and mean by Sinister (Friar) on Jun 29, 2005 at 14:52 UTC
Found it! Perhaps you meant 'grep' instead of 'map' ?? ;-> This gives a different result set then mine (larger in every case), but this could be a problem with my code... It is faster, untill the sets become a teeny weeny bit bigger (eg: +5M lines)	[reply]
Re^4: Finding an intersection of two sets, lean and mean by dragonchild (Archbishop) on Jun 29, 2005 at 15:00 UTC
Re^3: Finding an intersection of two sets, lean and mean by dragonchild (Archbishop) on Jun 29, 2005 at 14:36 UTC
if you;re going to tell me that my code is wrong, show me why. That way, I can either fix my code or demonstrate how you misused it (or both). Don't just say "Uh, your code suxx0rs" and walk away. That's not how OSS works. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^4: Finding an intersection of two sets, lean and mean by Sinister (Friar) on Jun 29, 2005 at 14:42 UTC
Re^5: Finding an intersection of two sets, lean and mean by dragonchild (Archbishop) on Jun 29, 2005 at 14:57 UTC
Re: Finding an intersection of two sets, lean and mean by radiantmatrix (Parson) on Jun 29, 2005 at 16:26 UTC
Why are you reinventing a well-solved problem? If your white- and black-list data is growing to the point where your standard "small and slow" isn't working, maybe it's time for a database. A `SELECT .. JOIN` construct lets the DB software find the intersections, and it can do so remarkably fast. Besides, what exactly is wrong with "huge and fast"? Disk space isn't generally a huge concern when you're talking about the sizes of CPAN modules... Yoda would agree with Perl design: there is no `try{}`	[reply] [d/l]
Re^2: Finding an intersection of two sets, lean and mean by Sinister (Friar) on Jun 30, 2005 at 07:51 UTC
I have taken this in to serious consideration, however the impact of this does not justify it. The problem with huge and fast is not disk space, it's memory space. The software that performs the mailings all runs daemonized, since start-up is our biggest penalty, having 5 500M(++) daemons laying around is not funny. I need to preserve RAM space (eg: pref. no more then the size of the file) and be fast with my blacklisting, since operators are getting more and more demanding. Dragonchild's solution was my first attempt as well, it consumes less memory but it is slow(er) so I came up with the above piece of code... TIMTOWTDI, remember?	[reply]
Re^3: Finding an intersection of two sets, lean and mean by radiantmatrix (Parson) on Jun 30, 2005 at 16:26 UTC
The problem with huge and fast is not disk space, it's memory space. The software that performs the mailings all runs daemonized, since start-up is our biggest penalty, having 5 500M(++) daemons laying around is not funny. Ah, I misunderstood what you meant by 'huge'. Still, if memory is your concern, that sounds like an even better reason to use a DB and let the DB handle the intersection calulations. BTW, what solution for intersection handling results in a 500MB memory footprint?! I'd like to know so I can avoid that myself. For the purpose of blacklisting, it might be small-and-fast to convert your list of addresses into a hash instead. Assuming you've already populated `@blacklist` and `@address`, your intersection sub might look like: `my @BlackListed = intersect_of(\@blacklist, \@address); sub intersect_of ($$) { my $a, $b = @_; my (%set_a, %set_b); ## put the larger set in %set_a if (@$a > @$b) { %set_a = map { $_ => undef } @$a; %set_b = map { $_ => undef } @$b; else { %set_a = map { $_ => undef } @$b; %set_b = map { $_ => undef } @$a; } my @intersect; ## iterate through smaller set for (keys %set_b) { push @intersect, $_ if exists $set_a{$_} } return @intersect; }` [download] This exact code is untested, but I have used code like it for whitelist/blacklist processing with address list files of about 5M each, and it performed quite acceptably. YMMV, of course. Yoda would agree with Perl design: there is no `try{}`	[reply] [d/l]