in reply to Gay Dating; or, Thrifty Set Intersections with an Accumulator
To answer directly the first question asked, I made a second pass to demonstrate the feasibility of making a second pass through a set of data records stored on disk. This is one of the primary purposes of the demo. Another is the technique of operating on each record, accumulating results, then discarding the record before operating on the next. The fact that these techniques, and others like them in the demo, are unnecessary or even stupid when applied to the model problem is wholly irrelevant.
I fear to maintain code that contains no comments or structure. This is one of the points I hope the little demo illustrates. Patently, as the size of the data set increases, a script's inefficiencies come to light and the code must be refactored. As the complexity of the problem set increases, the script must be rewritten and extended. Structure improves the chance that something can be retained.
To put it yet another way, my code is terrible but can be improved upon easily. JavaFan's code is perfect, which is good, since I don't think I could improve upon it at all.
Any expression of a concrete computer programming problem, consisting entirely of a set of pure, precise statements in set theory, is almost certainly of no practical application. To clarify, for the benefit of my intended audience:
Given a large data set of N elements; if you have a choice between (doing a fixed number C of operations on each element of N), and (doing a single operation on each combination of K elements of N); then you should seriously consider the former alternative, no matter how large you think C may be. If K is 2, it's a good thought; if K is greater than 2, the thought is mandatory.
Again, if you have a large data set, limited memory, and a need for some speed, you should seriously consider any approach that lets you read data records sequentially and discard each one after operating on it, even if that means a second or third read-and-operate pass through the same data.
"[I]f you're running out of memory, check whether you really need to keep all the information in memory at all times..." certainly is part of one of the things I'd hoped to convey in OP.
I don't need to talk about big O; I thought it would be good to mention. It's one of those topics upon which each successive wizard insists he can make a more definitive statement than all others. This is the reason why I needed to link to the 7th section of the Wikipedia article: Almost everything else in there represents the struggle of various editors to be righter than the last, while the reader suffocates in formalism. Around the watercooler, nobody distinguishes among O(N), Ω(N), Θ(N), or their juniors. Of course, the guys are all wrong. O gets the nod from the watercooler crowd, and in the boardroom, because it speaks to actual real world considerations of fear and greed; and because "Oh" is easier to say than "Omega" or "Theta".
I don't say a O proves anything; it may be a partial description of a class of approaches. Like the little figures I drew to illustrate my OP, big O (to me) is best used as a way of organizing thought; a broad outline rather than fine specification. Ways of thinking about how are some of the contents of OP.
For a short introduction to these notations, try Wikibooks.
"My case" does not exist. The specific data given in and manipulated by the demo script is not even intended to be a case; nor is its algorithm, nor its output. The topic of OP is not really gay dating; the model is not even an exemplar of a class of problems. The model is a tiny toy used to give other elements of OP a concrete object. Benchmarking the demo is ludicrous.
OP offers a set of tools to work on classes of problems which are much more difficult to attack than the model I give -- sufficiently difficult that I did not even attempt to sketch one of them in the compass of a node or a day. These tools are not neatly organized or classified. This is the equivalent of a junk drawer. I offer them freely to whoever might poke through them and take what he likes.
He may be offended, puzzled, confused, or made contemptuous by the offer; but I'm a bit put out when someone rattles something out of the drawer and complains it just doesn't work. None of it just works; that's why it's junk. You gotta put it together.
Of all the tools I have ever had available to me in the lab, my junk drawer is the second-most indispensable. (Mine is now, in fact, about a half-dozen large file boxes of junk accumulated over the years.) I can do without the shiny new data analyzer or scope, the sensitive chart recorder, the CNC mill, the ROM emulator, the grounding strap, the feeler gauges, wrenches, hammers, pliers, NPT taps and dies; even the soldering iron, the multimeter, and the roll of black tape. But I cannot create any large, complex, intricate hardware project without a battered flat-tip screwdriver, a junk drawer, and the back of an envelope.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Gay Dating; or, Thrifty Set Intersections with an Accumulator
by BrowserUk (Patriarch) on Aug 12, 2010 at 13:34 UTC | |
|
Re^2: Gay Dating; or, Thrifty Set Intersections with an Accumulator
by Jenda (Abbot) on Aug 12, 2010 at 23:19 UTC |