Let's discuss approaches to finding intersections among a large group of sets. I'll present a thrifty approach that uses much less memory and less time than a naive approach. We'll begin with theory and progress to an executable demo. I make the assumption that intersections between any two given sets are relatively small/few.
Please load, in another window, the small figures here.
Finding intersections among a group of sets can be abstracted partially to a graph in which each set is represented by a node and each comparison of two sets by an edge. The figure on the left, Party, is the complete graph on 8 nodes, called K8. It has 28 edges; each node is connected directly to every other. The figure on the right, Book, is a single-level tree of 9 nodes and 9 edges. One of the nodes is colored to indicate its special nature; it might be called the tree's root. (In the demo that follows, essentially every node and edge has a label; but this can be ignored at a high level of abstraction.)
If you're wondering, Why gay dating?, it's because otherwise the model would be a bipartite graph. As in most simple demos, we're ignoring a lot of subtleties.
Algorithmic complexity is often described using big O notation. This is a shorthand method of comparing the performance of algorithms on large data sets; specifically, as they increase without bound. Few algorithms are able to do useful work on a data set of N elements in better than O(N) time; but if a way can be found to keep most of the data set on disk, memory usage may well be O(1): constant regardless of set size. An example is finding the sum of a set of integers. Millions of integers might be read from a disk file and each added to an in-memory variable, an accumulator. Each element is discarded after the accumulation step; so memory usage does not scale as O(N). When all elements have been accumulated, the accumulator is printed.
Very roughly, O(N) performance is generally good; O(Nc) fair for small values of c; while more expensive algorithms are best avoided unless you're sure N will stay small. Of course, if you're able to do better than O(N), as perhaps O(log N), you're doing well indeed. The more data you have, the more important it is to improve your big O.
This node compares approaches in the range ( O(N) .. O(N2) ).
The model data consists of a number of guys looking to go out with other guys on specific days of the month. Al is only available on the 1st and 2nd of the month, and so on. There might be thousands of guys to consider. How do we hook them up?
One traditional solution is to hold a party. Guys circulate around the room, meeting other guys briefly and comparing their lists of availability. In order to get the complete solution, each guy must speak to every other guy, as indicated by the graph.
In Real Life, the main obstacle to a party is just getting everyone together for one night. But the time it takes for every guy to compare his calendar with every other guy may not be excessive, since comparisions can take place concurrently.
In a Perl script, we are generally (though not always) limited to linear processing; we must make one comparison after another. For N guys, that's (N2 - N) / 2 comparisons; we consider that O(N2) time, since the quadratic term quickly swamps the linear.
Memory use likely scales at O(N) but if the density of intersections is great, may approach O(N2). If we insist on lower memory demands, we find ourselves reading not only our data file, writing not only a results file; but also reading and writing an intermediate file. The spectre of random access looms. We may knock down memory use and find our script takes far too long to run.
Another traditional solution is the matchmaker or dating service, which maintains a book, a central repository of guys and the days they each have available. Each guy comes into the matchmaker's office, registers his availability, and leaves. After all guys have registered, then each guy returns to the office on a second visit, at which time he picks up his list of prospective matches: other guys who are also interested in meeting on the same days.
Notice that the number of trips to the book is 2N. Even if the cost of a trip to the book is much more expensive than crossing the room at a party, the approach is still O(N); we disregard the constant multiplier.
Let me caution readers against hasty conclusions. Like all models, this one leaves something out; like all demos, this leaves out a great deal. Please let me see if I can clear up some likely sources of confusion.
You might think that there's not much difference between the Party and the Book. The Party may even look more efficient, since there's only one trip to the party per guy. That's not quite the point I'm trying to make.
The Party model is more roughly equivalent to the problem of meeting other guys in a lost age, the one before telephones (not to mention iPhonies). Each guy travels on foot to every other guy's house, knocks on the door, and talks in person (if the other guy is even home). You might get a clearer picture of the Party model if you imagine a convention, such as International Mr. Leather, that draws many thousands of guys to town. If you insist that each attendee be matched to every other by city, fetish, and availability; and you want to do it before the show is over; then you need a central registry.
Sophisticated Monks will note that ultimately, O(N2) comparisons are still done in the Book model (and, with a quibble, in the demo I present shortly). The difference is that these comparisons are all done internally; there is no I/O -- each guy makes two trips to the book and that's it. The book itself is held in memory and it's a significant consideration. But the character of the data set greatly affects book size. I'll go into this in a moment.
Other Monks will likely object to the excessive cost of certain operations. You're probably looking at the wrong things. For any conventional computer program, there are time costs and space costs associated with each operation; and these costs may be further broken down. The time cost is very low for in-memory operations, high for sequential disk I/O, and very high for random-access disk I/O. Data must be kept somewhere; if it is kept in memory then the cost is very much higher than if it is kept on disk. It is even cheaper to offline files onto other media.
For all classes of cost, the nature of the object must be considered. In the gay dating model, the critical object is a short list of integers. This may be stored anywhere cheaply and compared with any other list cheaply. I do this so that the script may be intelligible. You should imagine a more complex data set.
This node was inspired by the plight of a fellow Monk whose bioinformatics data include tens of thousands of sets of about six floating-point numbers each. Comparisons between one set and the next are a matter of simultaneous inequalities. He is hitting a memory ceiling, trying to keep all his data in memory; and running time is, I gather, excessive; it just takes too long to plough through all that data in anything except a streamlined, systematic fashion.
In a given category of cost, any cost that grows only as O(N) is cheap, no matter how expensive it looks on a per-each basis. For large N, O(N2) costs dominate. These O(N2) costs -- in time, space, or both, depending on available resources -- are what you need to minimize.
A naive solution to the gay dating problem might be similar to the Party approach. All guys (data records) are read from disk into memory and comparisons done until the set of guys is exhausted. A naive programmer sees how many comparisons are being made and seeks to improve comparison speed in some way. He is stopped by an out-of-memory error; he reasons that, for each $guy, all other guys must be present in memory, so that each can be compared against that $guy. What to do? Improving a bad algorithm is not a solution.
This script implements the Book approach. Each $guy is read into memory and his availability stored in $book; then he's forgotten. On a second pass, each $guy is again read into memory; this time his availability is compared against the $book and possible @matches extracted.
Although the demo is simple, it is robust. Intricate, expensive comparisons can be done only after simple, cheap comparisons have proven some likelihood of success. You might also like to read through the demo with an eye to solving the gay bathhouse problem: What's the best time for a $guy to visit the bathhouse? In this case, you don't care who is at the bathhouse; you just want to be sure that when $guy goes, he doesn't find himself all alone. Therefore, you don't need to make expensive pairwise comparisons at all; you can just aggregate availability into the $book.
FINAL WARNING: This is a single, simple demonstration of concepts described above. It is the least important part of this node.
This demo is not a ready-made solution to any real data-handling problem. I don't insist on any particular feature of it. I don't assert that it runs precisely in O(N) time or uses O(N) memory. I do suggest that an approach based generally on this may open a door to a more efficient script. You may want to consider lifting some of its features for your own use. This is a toolbox, not an automobile.
I've all but entirely avoided modules that might greatly assist in this demo; its purpose is to illustrate algorithms. I strongly suggest, in any actual work, maximum employment of the vast resources of CPAN.
#!/run/bin/perl # gay-dating.pl # = Copyright 2010 Xiong Changnian <xiong@xuefang.com> = # = Free Software = Artistic License 2.0 = NO WARRANTY = use strict; use warnings; use feature qw( say ); use Perl6::Form; #--------------------------------------------------------------------- +-------# # KEY CONCEPTS # # DATA is read for $guy records; each consists of a $name and a list o +f @days. # A $day is an integer (1..31) on which a guy is free for playtime. # $book accumulates the varied day preferences of all of the guys. # $report records the matches found between guys. # pseudo-globals my $data_start ; # tell the first line of DATA my $book = [ # accumulator; fat boy # AoA: $book->[$day] eq [@names] ]; my $report = {}; # reports by -name # main my $exit = 1; # bash failure, initially $exit = ( main() +1 ) %2; # 1 for 0, 0 for 1 exit($exit); sub main { $data_start = tell DATA; # remember first_pass(); # accumulate wants seek DATA, $data_start, 0; # start over second_pass(); # make dates print_reports(); # see what we got return 1; }; # Load records and accumulate wants # Writes $book # sub first_pass { my $guy ; # current guy my $name ; # current guy's name GETDATA: while (<DATA>) { next GETDATA if ( skip_record($_) ); # comment or blank last GETDATA if ( last_record($_) ); # END $guy = parse_record($_); $name = $guy->{-name}; # for this guy do my @days = @{ $guy->{-days} }; # record wants - not cheating, just for our final output $report->{$name}{-wantdays} = \@days; # accumulate demands foreach my $day (@days) { push @{ $book->[$day] }, $name; }; }; }; # Find out who is compatible # sub second_pass { my $guy ; # current guy my $name ; # current guy's name GETDATA: while (<DATA>) { next GETDATA if ( skip_record($_) ); # comment or blank last GETDATA if ( last_record($_) ); # END $guy = parse_record($_); $name = $guy->{-name}; # for this guy do my @days = @{ $guy->{-days} }; # find hits on days foreach my $today (@days) { my @name_hits_today = @{ $book->[$today] }; @name_hits_today = weed($name, @name_hits_today +); foreach my $name_hit (@name_hits_today) { my $match = { -name => $name_hit, -day => $today, }; push @{ $report->{$name}{-matches} }, $match; }; }; }; }; # Dump out all the reports # sub print_reports { say q{}; say q*| name wantdays + |*; say q*|-------------------------------------------------------------- +-----|*; foreach my $name (sort keys %$report) { my $wantdays = join q{, }, @{ $report->{$name}{-wantdays} } +; print form q*| {<<<<<<<<<<} {<<<<<<<<<<} + |*, $name, $wantdays, ; foreach my $match ( @{ $report->{$name}{-matches} } ) { my $boytoy = $match->{-name}; my $today = $match->{-day}; my $ith = get_ith($today); my $message = qq{$boytoy can meet you on the $today$ith.}; print form q*| {<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<} + |*, $message, ; }; if ( not @{ $report->{$name}{-matches} } ) { my $message = qq{Sorry! No matches.}; print form q*| {<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<} + |*, $message, ; }; say q*| + |*; }; say q*|-------------------------------------------------------------- +-----|*; say q{}; }; # my $guy = parse_record( $string ); # # key meaning example # # -name $guy's name 'Al' # -days days of month available [1, 2] # sub parse_record { my $string = $_[0]; chomp $string ; my @split_white = split ' ', $string; my $name = $split_white[0]; my @days = split /[,|\s]+/, $split_white[1]; my $guy = { -name => $name, -days => \@days, }; return $guy; }; # $bool = skip_record($_) # sub skip_record { local $_ = shift; return 1 if /^[\s]*#/; # comment return 1 if /^[\s]*$/; # essentially blank line return 0; }; # $bool = last_record($_) # sub last_record { local $_ = shift; return 1 if /^__END__/; # end of data; ignore return 0; }; # Convert cardinals to ordinals # sub get_ith { local $_ = shift; return q{th} if /1\d/; # 11th, 12th, 13th return q{st} if /1$/; # 1st, 21st, 31st return q{nd} if /2$/; # 2nd, 22nd return q{rd} if /3$/; # 3rd, 23rd return q{th}; # default }; # Narrow a list of matches according to some hypothetical, expensive c +riteria # We only eliminate masturbation in this demo. # Note that more than one filtering step can be applied here; # the cheaper steps should come first, aborting further processing # sub weed { my $name = shift; my @names = @_; my @not_mes = grep {!/$name/} @names; return @not_mes; }; __DATA__ # name days Al 1,2 Bob 2,13 Chuck 12,4 Dick 3,7,30 Edgar 5,7 Fred 2,23 Greg 4,5,12 Harry 6 Ian 1,20 Jack 4,23 __END__
Output:
| name wantdays | |-------------------------------------------------------------------| | Al 1, 2 | | Ian can meet you on the 1st. | | Bob can meet you on the 2nd. | | Fred can meet you on the 2nd. | | | | Bob 2, 13 | | Al can meet you on the 2nd. | | Fred can meet you on the 2nd. | | | | Chuck 12, 4 | | Greg can meet you on the 12th. | | Greg can meet you on the 4th. | | Jack can meet you on the 4th. | | | | Dick 3, 7, 30 | | Edgar can meet you on the 7th. | | | | Edgar 5, 7 | | Greg can meet you on the 5th. | | Dick can meet you on the 7th. | | | | Fred 2, 23 | | Al can meet you on the 2nd. | | Bob can meet you on the 2nd. | | Jack can meet you on the 23rd. | | | | Greg 4, 5, 12 | | Chuck can meet you on the 4th. | | Jack can meet you on the 4th. | | Edgar can meet you on the 5th. | | Chuck can meet you on the 12th. | | | | Harry 6 | | Sorry! No matches. | | | | Ian 1, 20 | | Al can meet you on the 1st. | | | | Jack 4, 23 | | Chuck can meet you on the 4th. | | Greg can meet you on the 4th. | | Fred can meet you on the 23rd. | | | |-------------------------------------------------------------------|
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |