comment on

The problem with huge and fast is not disk space, it's memory space. The software that performs the mailings all runs daemonized, since start-up is our biggest penalty, having 5 500M(++) daemons laying around is not funny.

Ah, I misunderstood what you meant by 'huge'. Still, if memory is your concern, that sounds like an even better reason to use a DB and let the DB handle the intersection calulations. BTW, what solution for intersection handling results in a 500MB memory footprint?! I'd like to know so I can avoid that myself.

For the purpose of blacklisting, it might be small-and-fast to convert your list of addresses into a hash instead. Assuming you've already populated @blacklist and @address, your intersection sub might look like:

my @BlackListed = intersect_of(\@blacklist, \@address);
sub intersect_of ($$) {
   my $a, $b = @_;
   my (%set_a, %set_b);
   
   ## put the larger set in %set_a
   if (@$a > @$b) {
      %set_a = map { $_ => undef } @$a;
      %set_b = map { $_ => undef } @$b;
   else {
      %set_a = map { $_ => undef } @$b;
      %set_b = map { $_ => undef } @$a;
   }
   
   my @intersect;

   ## iterate through smaller set
   for (keys %set_b) {
      push @intersect, $_ if exists $set_a{$_}
   }
   
   return @intersect;
}
[download]

This exact code is untested, but I have used code like it for whitelist/blacklist processing with address list files of about 5M each, and it performed quite acceptably. YMMV, of course.

Yoda would agree with Perl design: there is no try{}

In reply to Re^3: Finding an intersection of two sets, lean and mean by radiantmatrix
in thread Finding an intersection of two sets, lean and mean by Sinister

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks