Gay Dating; or, Thrifty Set Intersections with an Accumulator

Let's discuss approaches to finding intersections among a large group of sets. I'll present a thrifty approach that uses much less memory and less time than a naive approach. We'll begin with theory and progress to an executable demo. I make the assumption that intersections between any two given sets are relatively small/few.

Please load, in another window, the small figures here.

Background

Finding intersections among a group of sets can be abstracted partially to a graph in which each set is represented by a node and each comparison of two sets by an edge. The figure on the left, Party, is the complete graph on 8 nodes, called K₈. It has 28 edges; each node is connected directly to every other. The figure on the right, Book, is a single-level tree of 9 nodes and 9 edges. One of the nodes is colored to indicate its special nature; it might be called the tree's root. (In the demo that follows, essentially every node and edge has a label; but this can be ignored at a high level of abstraction.)

If you're wondering, Why gay dating?, it's because otherwise the model would be a bipartite graph. As in most simple demos, we're ignoring a lot of subtleties.

Algorithmic complexity is often described using big O notation. This is a shorthand method of comparing the performance of algorithms on large data sets; specifically, as they increase without bound. Few algorithms are able to do useful work on a data set of N elements in better than O(N) time; but if a way can be found to keep most of the data set on disk, memory usage may well be O(1): constant regardless of set size. An example is finding the sum of a set of integers. Millions of integers might be read from a disk file and each added to an in-memory variable, an accumulator. Each element is discarded after the accumulation step; so memory usage does not scale as O(N). When all elements have been accumulated, the accumulator is printed.

Very roughly, O(N) performance is generally good; O(N^c) fair for small values of c; while more expensive algorithms are best avoided unless you're sure N will stay small. Of course, if you're able to do better than O(N), as perhaps O(log N), you're doing well indeed. The more data you have, the more important it is to improve your big O.

This node compares approaches in the range ( O(N) .. O(N²) ).

The Model

The model data consists of a number of guys looking to go out with other guys on specific days of the month. Al is only available on the 1st and 2nd of the month, and so on. There might be thousands of guys to consider. How do we hook them up?

Party

One traditional solution is to hold a party. Guys circulate around the room, meeting other guys briefly and comparing their lists of availability. In order to get the complete solution, each guy must speak to every other guy, as indicated by the graph.

In Real Life, the main obstacle to a party is just getting everyone together for one night. But the time it takes for every guy to compare his calendar with every other guy may not be excessive, since comparisions can take place concurrently.

In a Perl script, we are generally (though not always) limited to linear processing; we must make one comparison after another. For N guys, that's (N² - N) / 2 comparisons; we consider that O(N²) time, since the quadratic term quickly swamps the linear.

Memory use likely scales at O(N) but if the density of intersections is great, may approach O(N²). If we insist on lower memory demands, we find ourselves reading not only our data file, writing not only a results file; but also reading and writing an intermediate file. The spectre of random access looms. We may knock down memory use and find our script takes far too long to run.

Book

Another traditional solution is the matchmaker or dating service, which maintains a book, a central repository of guys and the days they each have available. Each guy comes into the matchmaker's office, registers his availability, and leaves. After all guys have registered, then each guy returns to the office on a second visit, at which time he picks up his list of prospective matches: other guys who are also interested in meeting on the same days.

Notice that the number of trips to the book is 2N. Even if the cost of a trip to the book is much more expensive than crossing the room at a party, the approach is still O(N); we disregard the constant multiplier.

Traps and Confusions

Let me caution readers against hasty conclusions. Like all models, this one leaves something out; like all demos, this leaves out a great deal. Please let me see if I can clear up some likely sources of confusion.

You might think that there's not much difference between the Party and the Book. The Party may even look more efficient, since there's only one trip to the party per guy. That's not quite the point I'm trying to make.

The Party model is more roughly equivalent to the problem of meeting other guys in a lost age, the one before telephones (not to mention iPhonies). Each guy travels on foot to every other guy's house, knocks on the door, and talks in person (if the other guy is even home). You might get a clearer picture of the Party model if you imagine a convention, such as International Mr. Leather, that draws many thousands of guys to town. If you insist that each attendee be matched to every other by city, fetish, and availability; and you want to do it before the show is over; then you need a central registry.

Sophisticated Monks will note that ultimately, O(N²) comparisons are still done in the Book model (and, with a quibble, in the demo I present shortly). The difference is that these comparisons are all done internally; there is no I/O -- each guy makes two trips to the book and that's it. The book itself is held in memory and it's a significant consideration. But the character of the data set greatly affects book size. I'll go into this in a moment.

Other Monks will likely object to the excessive cost of certain operations. You're probably looking at the wrong things. For any conventional computer program, there are time costs and space costs associated with each operation; and these costs may be further broken down. The time cost is very low for in-memory operations, high for sequential disk I/O, and very high for random-access disk I/O. Data must be kept somewhere; if it is kept in memory then the cost is very much higher than if it is kept on disk. It is even cheaper to offline files onto other media.

For all classes of cost, the nature of the object must be considered. In the gay dating model, the critical object is a short list of integers. This may be stored anywhere cheaply and compared with any other list cheaply. I do this so that the script may be intelligible. You should imagine a more complex data set.

This node was inspired by the plight of a fellow Monk whose bioinformatics data include tens of thousands of sets of about six floating-point numbers each. Comparisons between one set and the next are a matter of simultaneous inequalities. He is hitting a memory ceiling, trying to keep all his data in memory; and running time is, I gather, excessive; it just takes too long to plough through all that data in anything except a streamlined, systematic fashion.

In a given category of cost, any cost that grows only as O(N) is cheap, no matter how expensive it looks on a per-each basis. For large N, O(N²) costs dominate. These O(N²) costs -- in time, space, or both, depending on available resources -- are what you need to minimize.

Demo

A naive solution to the gay dating problem might be similar to the Party approach. All guys (data records) are read from disk into memory and comparisons done until the set of guys is exhausted. A naive programmer sees how many comparisons are being made and seeks to improve comparison speed in some way. He is stopped by an out-of-memory error; he reasons that, for each $guy, all other guys must be present in memory, so that each can be compared against that $guy. What to do? Improving a bad algorithm is not a solution.

This script implements the Book approach. Each $guy is read into memory and his availability stored in $book; then he's forgotten. On a second pass, each $guy is again read into memory; this time his availability is compared against the $book and possible @matches extracted.

Although the demo is simple, it is robust. Intricate, expensive comparisons can be done only after simple, cheap comparisons have proven some likelihood of success. You might also like to read through the demo with an eye to solving the gay bathhouse problem: What's the best time for a $guy to visit the bathhouse? In this case, you don't care who is at the bathhouse; you just want to be sure that when $guy goes, he doesn't find himself all alone. Therefore, you don't need to make expensive pairwise comparisons at all; you can just aggregate availability into the $book.

FINAL WARNING: This is a single, simple demonstration of concepts described above. It is the least important part of this node.

This demo is not a ready-made solution to any real data-handling problem. I don't insist on any particular feature of it. I don't assert that it runs precisely in O(N) time or uses O(N) memory. I do suggest that an approach based generally on this may open a door to a more efficient script. You may want to consider lifting some of its features for your own use. This is a toolbox, not an automobile.

I've all but entirely avoided modules that might greatly assist in this demo; its purpose is to illustrate algorithms. I strongly suggest, in any actual work, maximum employment of the vast resources of CPAN.


#!/run/bin/perl
#       gay-dating.pl
#       = Copyright 2010 Xiong Changnian <xiong@xuefang.com> =
#       = Free Software = Artistic License 2.0 = NO WARRANTY =

use strict;
use warnings;

use feature qw( say );
use Perl6::Form;

#---------------------------------------------------------------------
+-------#

# KEY CONCEPTS
#
# DATA is read for $guy records; each consists of a $name and a list o
+f @days.
# A $day is an integer (1..31) on which a guy is free for playtime.
# $book accumulates the varied day preferences of all of the guys.
# $report records the matches found between guys. 


# pseudo-globals
my $data_start      ;               # tell the first line of DATA

my $book            = [             # accumulator; fat boy
                                    #   AoA: $book->[$day] eq [@names]
];

my $report          = {};           # reports by -name

# main
my $exit            = 1;            # bash failure, initially
$exit = ( main() +1 ) %2;           # 1 for 0, 0 for 1
exit($exit);


sub main {
    $data_start     = tell DATA;    # remember
    first_pass();                   # accumulate wants
    seek DATA, $data_start, 0;      # start over
    second_pass();                  # make dates
    print_reports();                # see what we got
    return 1;
};

# Load records and accumulate wants
#   Writes $book
#
sub first_pass {
    my $guy         ;               # current guy
    my $name        ;               # current guy's name
    
    GETDATA:
    while (<DATA>) {
        
        next GETDATA if ( skip_record($_) );    # comment or blank
        last GETDATA if ( last_record($_) );    # END
        $guy        = parse_record($_);
        $name       = $guy->{-name};
        
        # for this guy do
        my @days                        = @{ $guy->{-days} };
        
        # record wants - not cheating, just for our final output
        $report->{$name}{-wantdays}     = \@days;
        
        # accumulate demands
        foreach my $day (@days) {
            push @{ $book->[$day] }, $name;
        };  
    };
};

# Find out who is compatible
#
sub second_pass {
    my $guy         ;               # current guy
    my $name        ;               # current guy's name
    
    GETDATA:
    while (<DATA>) {
        
        next GETDATA if ( skip_record($_) );    # comment or blank
        last GETDATA if ( last_record($_) );    # END
        $guy        = parse_record($_);
        $name       = $guy->{-name};
        
        # for this guy do
        my @days                        = @{ $guy->{-days} };
        
        # find hits on days
        foreach my $today (@days) {
            my @name_hits_today         = @{ $book->[$today] };
            @name_hits_today            = weed($name, @name_hits_today
+);
            
            foreach my $name_hit (@name_hits_today) {
                my $match   = {
                                -name   => $name_hit,
                                -day    => $today,
                            };
                push @{ $report->{$name}{-matches} }, $match;
            };
        };
    };
};

# Dump out all the reports
#
sub print_reports {
    
say q{};
say  q*|    name            wantdays                                  
+     |*;
say  q*|--------------------------------------------------------------
+-----|*;
foreach my $name (sort keys %$report) {
    my $wantdays        = join q{, }, @{ $report->{$name}{-wantdays} }
+;
    
        print form
     q*|    {<<<<<<<<<<}    {<<<<<<<<<<}                              
+     |*,
            $name,          $wantdays,          
        ;
    
    foreach my $match ( @{ $report->{$name}{-matches} } ) {
        my $boytoy      = $match->{-name};
        my $today       = $match->{-day};
        my $ith         = get_ith($today);
        my $message     = qq{$boytoy can meet you on the $today$ith.};
        
        print form
     q*|        {<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<}     
+     |*,
                $message,          
        ;
    };

    if ( not @{ $report->{$name}{-matches} } ) {
        my $message     = qq{Sorry! No matches.};
        
        print form
     q*|        {<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<}     
+     |*,
                $message,          
        ;
        
    };
    
say  q*|                                                              
+     |*;
    };
say  q*|--------------------------------------------------------------
+-----|*;
say q{};

};

# my $guy       = parse_record( $string );
#
#   key     meaning                             example
#
#   -name   $guy's name                         'Al'
#   -days   days of month available             [1, 2]
#
sub parse_record {
    my $string          = $_[0];
    chomp $string       ;
    my @split_white     = split ' ', $string;
    my $name            = $split_white[0];
    my @days            = split /[,|\s]+/, $split_white[1];
    
    my $guy     = {
        -name       => $name,
        -days       => \@days,
        
    };
    
    return $guy;
};

# $bool     = skip_record($_)
#
sub skip_record {
    local $_    = shift;
    return 1 if /^[\s]*#/;          # comment
    return 1 if /^[\s]*$/;          # essentially blank line 
    return 0;
};

# $bool     = last_record($_)
#
sub last_record {
    local $_    = shift;
    
    return 1 if /^__END__/;         # end of data; ignore 
    return 0;
};

# Convert cardinals to ordinals
#
sub get_ith {
    local $_    = shift;
    
    return q{th} if /1\d/;      # 11th, 12th,   13th
    return q{st} if /1$/;       # 1st,  21st,   31st
    return q{nd} if /2$/;       # 2nd,  22nd
    return q{rd} if /3$/;       # 3rd,  23rd
    return q{th};               # default
};

# Narrow a list of matches according to some hypothetical, expensive c
+riteria
# We only eliminate masturbation in this demo.
# Note that more than one filtering step can be applied here; 
#   the cheaper steps should come first, aborting further processing
# 
sub weed {
    my $name        = shift;
    my @names       = @_;
    
    my @not_mes     = grep {!/$name/} @names;
    
    return @not_mes;
};


__DATA__
# name      days

Al          1,2
Bob         2,13
Chuck       12,4
Dick        3,7,30
Edgar       5,7
Fred        2,23
Greg        4,5,12
Harry       6
Ian         1,20
Jack        4,23
__END__
[download]

Output:


|    name            wantdays                                       |
|-------------------------------------------------------------------|
|    Al              1, 2                                           |
|        Ian can meet you on the 1st.                               |
|        Bob can meet you on the 2nd.                               |
|        Fred can meet you on the 2nd.                              |
|                                                                   |
|    Bob             2, 13                                          |
|        Al can meet you on the 2nd.                                |
|        Fred can meet you on the 2nd.                              |
|                                                                   |
|    Chuck           12, 4                                          |
|        Greg can meet you on the 12th.                             |
|        Greg can meet you on the 4th.                              |
|        Jack can meet you on the 4th.                              |
|                                                                   |
|    Dick            3, 7, 30                                       |
|        Edgar can meet you on the 7th.                             |
|                                                                   |
|    Edgar           5, 7                                           |
|        Greg can meet you on the 5th.                              |
|        Dick can meet you on the 7th.                              |
|                                                                   |
|    Fred            2, 23                                          |
|        Al can meet you on the 2nd.                                |
|        Bob can meet you on the 2nd.                               |
|        Jack can meet you on the 23rd.                             |
|                                                                   |
|    Greg            4, 5, 12                                       |
|        Chuck can meet you on the 4th.                             |
|        Jack can meet you on the 4th.                              |
|        Edgar can meet you on the 5th.                             |
|        Chuck can meet you on the 12th.                            |
|                                                                   |
|    Harry           6                                              |
|        Sorry! No matches.                                         |
|                                                                   |
|    Ian             1, 20                                          |
|        Al can meet you on the 1st.                                |
|                                                                   |
|    Jack            4, 23                                          |
|        Chuck can meet you on the 4th.                             |
|        Greg can meet you on the 4th.                              |
|        Fred can meet you on the 23rd.                             |
|                                                                   |
|-------------------------------------------------------------------|
[download]

2010-08-11:

Improved adjective.

Note: Contents may change.

Comment on Gay Dating; or, Thrifty Set Intersections with an Accumulator Select or Download Code

Replies are listed 'Best First'.
Re: Gay Dating; or, Thrifty Set Intersections with an Accumulator by JavaFan (Canon) on Aug 11, 2010 at 15:18 UTC
The first thing I noticed about your post was the statement: This node compares approaches in the range ( O(N) .. O(N²) ). and I thought WTF? It's a statement that doesn't make sense. You do not seem to realize that expressing a complexity in "big-Oh" notation doesn't give you more information than an asymptotic upper bound. Stating an algorithm has an `O(N²)` run time complexity doesn't mean the algorithm actually grows quadratically. Searching in an unsorted array, for instance, has a run-time complexity of `O(N²)`. It also has a run-time complexity of `O(N)`. So, I read your post more carefully. You're using a lot of woolly words, but it doesn't seem to be more than a logical expansion of how the FAQ tells you how to intersect arrays. Your memory requirements seem to be `Ω(dN)`, where `N` is the number of people involved, and `d` the number of days to be considered. No naïve algorithm would use more. What I don't understand is why you need to rescan the file a second time - you do store all the names anyway. For the amount of text and code, I'm not really impressed. Here's a program that does essentially the same (it doesn't print a bounding box). About half of the code has to do with formatting, not with actual calculations. Half of the remaining half is used for reading the data. use 5.010; use warnings; use strict; my (@book, @people); while (<DATA>) { chomp; next if /^#/ \|\| !/\S/; my ($name, $days) = split; my @days = split /,/, $days; foreach my $day (@days) { push @{$book[$day]}, $name; } push @people, [$name, \@days]; } foreach my $info (@people) { my ($name, $days) = @$info; my %matches; foreach my $day (@$days) { my @match = grep {$_ ne $name} @{$book[$day]} or next; $matches{$day} = \@match if @match; } show $name, $days, \%matches; } sub show { my ($name, $days, $matches) = @_; printf "%-20s %s\n", $name, join ", ", @$days; unless (%$matches) { print " Sorry! No matches.\n"; } else { foreach my $day (@$days) { foreach my $name (@{$$matches{$day} \|\| []}) { print " $name can meet you on the $day"; given ($day) { when (/1[0-9]$/) {print "th"} when (/1$/) {print "st"} when (/2$/) {print "nd"} when (/3$/) {print "rd"} default {print "th"} } print ".\n"; } } } print "\n"; } __DATA__ # name days Al 1,2 Bob 2,13 Chuck 12,4 Dick 3,7,30 Edgar 5,7 Fred 2,23 Greg 4,5,12 Harry 6 Ian 1,20 Jack 4,23 [download]	[reply] [d/l]
Re^2: Gay Dating; or, Thrifty Set Intersections with an Accumulator by SuicideJunkie (Vicar) on Aug 11, 2010 at 21:22 UTC
A range of options between O(N) and O(N²) makes perfect sense to me. As long as you're not trying to mentally apply Perl's range operator that is. :) It simply means you've got a choice between, for example: O(N) memory with a huge constant, O(N log N) with a modest constant, or O(N²) with a small constant. Which one you should use depends on the size of N. When you bring in different independent costs, you could also be considering O(N)cpu and O(N log N) memory, vs O(N log N) cpu and O(N) memory. If you care to think of it this way, there is a multidimensional continuum of efficiencies between O(N) and O(N²) to consider. BTW, I don't know of anybody who bothers to be technical enough to use "Omega" instead of just saying Big-Oh. Except you I suppose. In practice, I've found it to always be a "Big-Crossed-Fingers" aka "Best Guess at Omega But Not Worth Strictly Proving; Nobody Wants The Cost To Make It Rigorous." Few people would know what you're talking about if you said Big-Omega, instead of Big-Oh. Big-Oh gets used as a generic term whenever there are non-mathies around (which is always unless you're still a math student). Bit of a self-sustaining "problem" but as far as I can tell, nobody really cares.	[reply]
Re^3: Gay Dating; or, Thrifty Set Intersections with an Accumulator by JavaFan (Canon) on Aug 11, 2010 at 21:43 UTC
It simply means you've got a choice between, for example: O(N) memory with a huge constant, O(N log N) with a modest constant, or O(N2) with a small constant. Which one you should use depends on the size of N. That doesn't make any sense. When you bring in different independent costs, you could also be considering O(N)cpu and O(N log N) memory I presume you mean "consider a program that manage to fill supra linear memory in linear time". Which means, even with removing the big-Oh rubbish, you still don't make sense. Since you can only write to a fixed amount of memory in a fixed time, you cannot use more than a linear amount of memory if your program runs in linear time. I don't know of anybody who bothers to be technical enough to use "Omega" instead of just saying Big-Oh. Except you I suppose. The difference between "Omega" and "Big-Oh" isn't an obscure technicality. It's as big as the difference between `<` and `>`. Or is that a difference you usually cannot be bothered about either?	[reply]
Re^4: Gay Dating; or, Thrifty Set Intersections with an Accumulator by SuicideJunkie (Vicar) on Aug 12, 2010 at 18:17 UTC
Re^5: Gay Dating; or, Thrifty Set Intersections with an Accumulator by JavaFan (Canon) on Aug 12, 2010 at 21:12 UTC
Some notes below your chosen depth have not been shown here
Re^2: Gay Dating; or, Thrifty Set Intersections with an Accumulator by Xiong (Hermit) on Aug 11, 2010 at 16:09 UTC
Big O notation is a little woolly all by itself. I think you know this, so I won't drive the point into the ground. For some value of "faster" (where faster is not better), an O(N²)-bounded algorithm grows faster than an O(N)-bounded one. That doesn't ensure the latter is always preferred; it only indicates. I stated a range to exclude the possibility of better (for some value of "better") algorithms than O(N) (or worse than O(N²)). This is not a sophisticated or advanced node. This is a pretty basic introduction to thinking about how to attack a problem set. As stated clearly in my FINAL WARNING, the script shown is not any sort of solution to the bioinformatics exercise, here only hinted at. It's a demonstration of technique, only loosely related to the preceding essay. There are only ten simple records; almost anything would work. If we wait, somebody may golf it down to a one-liner. The difference between my code and your code is that, as the data set gets larger and the problem set more complex, my editing may go a bit more smoothly. I don't doubt that you'll wind up with a really efficient script; you're good. But as the stress piles on, I'll accommodate it; my pieces are flexible and modular (in the wider sense). To put it another way, I would fear to maintain your code, while you would only hate to maintain mine. Note: Contents may change.	[reply] [d/l]
Re^3: Gay Dating; or, Thrifty Set Intersections with an Accumulator by JavaFan (Canon) on Aug 11, 2010 at 18:41 UTC
For some value of "faster" (where faster is not better), an O(N2)-bounded algorithm grows faster than an O(N)-bounded one. This shows you don't know much about big-Oh. Any algorithm whose running time is in "`O(N)`" (for N towards infinity) is also in "`O(N²)`". And big-Oh notation is not "woolly". In fact, if the running time of an algorithm is expressed as `T(N)`, with N the size of the input, and it's said that `T(N)` is in `O(f(N))`, it means that: ∃ M, c > 0: ∀ m > M: T(m) <= c f(m) The difference between my code and your code is that, as the data set gets larger and the problem set more complex, my editing may go a bit more smoothly. And you base this statement on what? (And what has editing to do with the size of the data set?) To put it another way, I would fear to maintain your code, while you would only hate to maintain mine. I don't know what to think of a programmer that fears to maintain code that consists of two small, self contained blocks, and a small subroutine.	[reply]
Re: Gay Dating; or, Thrifty Set Intersections with an Accumulator by BrowserUk (Patriarch) on Aug 11, 2010 at 14:53 UTC
Why do you make a second pass of the file, when you have stored in memory, all the information to make that second pass, during the first pass?	[reply]
Re: Gay Dating; or, Thrifty Set Intersections with an Accumulator by BrowserUk (Patriarch) on Aug 11, 2010 at 18:24 UTC
Hm. By way of experiment, I tweaked your code to read from a file:`open DATA, '<', $ARGV[ 0 ] or die $!;` I then generated some random data: c:\test>head 854367-11m.dat aaaaa 5,12,11,15 aaaab 23,26,12,27,17 aaaac 18,1,5 aaaad 20 aaaae 29,6,6,31,4 aaaaf 27,11,29,4,3 aaaag 12,4,6,28 aaaah 19,29,18,26,21 aaaai 11,22,6 aaaaj 8 c:\test>tail 854367-11m.dat zzzyv 26 zzzyw 7 zzzyx 29,17,14 zzzyy 26,6,24,21,3,25 zzzyz 9,24,30 zzzza 5,26,3,17,28 zzzzb 16,1,6,14,9,9 zzzzc 22,2,1,28,7,21 zzzzd 17 zzzze 2 11/08/2010 18:23 1,787 854367-100.dat 11/08/2010 18:15 17,311 854367-1000.dat 11/08/2010 18:15 174,511 854367-10000.dat 11/08/2010 18:15 1,747,330 854367-100000.dat 11/08/2010 18:15 207,709,322 854367-11m.dat 11/08/2010 18:23 3,532 854367-200.dat 11/08/2010 18:36 34,777 854367-2000.dat 11/08/2010 19:02 52,181 854367-3000.dat [download] And ran a few smaller sets: `c:\test>854367 854367-1000.dat >nul Check mem, cpu & time: Terminating on signal SIGINT(2) c:\test>854367 854367-2000.dat >nul Check mem, cpu & time: Terminating on signal SIGINT(2) c:\test>854367 854367-3000.dat >nul Check mem, cpu & time: Terminating on signal SIGINT(2)` [download] The results: `N Mem(MB) Time(secs) CPU(billion cycles)` `1000 178 267 622` `2000 678 1105 2671` `3000 1501 2595 5970` ## Updated with actual timings. `10000 14000 n/avail. n/avail.` What do you recon: O(N²) for time and O(N³) memory?	[reply] [d/l] [select]
Re: Gay Dating; or, Thrifty Set Intersections with an Accumulator by Jenda (Abbot) on Aug 11, 2010 at 21:35 UTC
Seems to me this long thread could be summed up to a simple advice ... if you're running out of memory, check whether you really need to keep all the information in memory at all times or whether you could rather extract just a small subset of the data of each "object" and keep just that. Let me give you an example. Suppose you are looking for duplicate files in a directory structure? Will you keep all the contents of all those files in memory? Of course not! You'll compute a hash and store just the hash (and maybe the file size) in memory and compare that. Is that rocket science? Do you need to talk about the O() notation? In your case it's actually even simpler, you do not compute some value out of the whole objects, you just extract a few of the attributes. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re: Gay Dating; or, Thrifty Set Intersections with an Accumulator by Xiong (Hermit) on Aug 12, 2010 at 06:33 UTC
To answer directly the first question asked, I made a second pass to demonstrate the feasibility of making a second pass through a set of data records stored on disk. This is one of the primary purposes of the demo. Another is the technique of operating on each record, accumulating results, then discarding the record before operating on the next. The fact that these techniques, and others like them in the demo, are unnecessary or even stupid when applied to the model problem is wholly irrelevant. I fear to maintain code that contains no comments or structure. This is one of the points I hope the little demo illustrates. Patently, as the size of the data set increases, a script's inefficiencies come to light and the code must be refactored. As the complexity of the problem set increases, the script must be rewritten and extended. Structure improves the chance that something can be retained. To put it yet another way, my code is terrible but can be improved upon easily. JavaFan's code is perfect, which is good, since I don't think I could improve upon it at all. Any expression of a concrete computer programming problem, consisting entirely of a set of pure, precise statements in set theory, is almost certainly of no practical application. To clarify, for the benefit of my intended audience: Given a large data set of N elements; if you have a choice between (doing a fixed number C of operations on each element of N), and (doing a single operation on each combination of K elements of N); then you should seriously consider the former alternative, no matter how large you think C may be. If K is 2, it's a good thought; if K is greater than 2, the thought is mandatory. Again, if you have a large data set, limited memory, and a need for some speed, you should seriously consider any approach that lets you read data records sequentially and discard each one after operating on it, even if that means a second or third read-and-operate pass through the same data. "[I]f you're running out of memory, check whether you really need to keep all the information in memory at all times..." certainly is part of one of the things I'd hoped to convey in OP. I don't need to talk about big O; I thought it would be good to mention. It's one of those topics upon which each successive wizard insists he can make a more definitive statement than all others. This is the reason why I needed to link to the 7th section of the Wikipedia article: Almost everything else in there represents the struggle of various editors to be righter than the last, while the reader suffocates in formalism. Around the watercooler, nobody distinguishes among O(N), Ω(N), Θ(N), or their juniors. Of course, the guys are all wrong. O gets the nod from the watercooler crowd, and in the boardroom, because it speaks to actual real world considerations of fear and greed; and because "Oh" is easier to say than "Omega" or "Theta". I don't say a O proves anything; it may be a partial description of a class of approaches. Like the little figures I drew to illustrate my OP, big O (to me) is best used as a way of organizing thought; a broad outline rather than fine specification. Ways of thinking about how are some of the contents of OP. For a short introduction to these notations, try Wikibooks. "My case" does not exist. The specific data given in and manipulated by the demo script is not even intended to be a case; nor is its algorithm, nor its output. The topic of OP is not really gay dating; the model is not even an exemplar of a class of problems. The model is a tiny toy used to give other elements of OP a concrete object. Benchmarking the demo is ludicrous. OP offers a set of tools to work on classes of problems which are much more difficult to attack than the model I give -- sufficiently difficult that I did not even attempt to sketch one of them in the compass of a node or a day. These tools are not neatly organized or classified. This is the equivalent of a junk drawer. I offer them freely to whoever might poke through them and take what he likes. He may be offended, puzzled, confused, or made contemptuous by the offer; but I'm a bit put out when someone rattles something out of the drawer and complains it just doesn't work. None of it just works; that's why it's junk. You gotta put it together. Of all the tools I have ever had available to me in the lab, my junk drawer is the second-most indispensable. (Mine is now, in fact, about a half-dozen large file boxes of junk accumulated over the years.) I can do without the shiny new data analyzer or scope, the sensitive chart recorder, the CNC mill, the ROM emulator, the grounding strap, the feeler gauges, wrenches, hammers, pliers, NPT taps and dies; even the soldering iron, the multimeter, and the roll of black tape. But I cannot create any large, complex, intricate hardware project without a battered flat-tip screwdriver, a junk drawer, and the back of an envelope. Note: Contents may change.	[reply] [d/l]
Re^2: Gay Dating; or, Thrifty Set Intersections with an Accumulator by BrowserUk (Patriarch) on Aug 12, 2010 at 13:34 UTC
To answer directly the first question asked, I made a second pass to demonstrate the feasibility of making a second pass through a set of data records stored on disk. Oooh. You can read a file twice. N't alota pe'ple know tha'. The point is, regardless of the "model problem" you chose, if you do not store the data from the first pass, then the second pass is useless because you have nothing from the first pass to compare against. And if you do store the data from the first pass, the second pass is unnecessary and pointless. As for your clarification, it's meaningless. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re^2: Gay Dating; or, Thrifty Set Intersections with an Accumulator by Jenda (Abbot) on Aug 12, 2010 at 23:19 UTC
I'd rather maintain, modify and improve upon JavaFan's code than yours. The fact that you split the code into several subroutines with long names doesn't by itself mean the code is better or "has structure". Both your code and your posts are long, long, long. That's good for dating, but bad for both code and technical discussion. Let me quote for you a few sentences from "the little book": "A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason a drawing should have no unnecessary lines and a machine no unnecessary parts. This requires not that the writer make all sentences short or avoid all detail and treat subjects only in outline, but that every word tell." Jenda Enoch was right! Enjoy the last years of Rome.	[reply]