Re: Most efficient record selection method?

For example, we have 5 comma separated data file with 20 fields and we have been asked to show all variants contained in each of 3 of the fields over the 5 files in the least number of records.

Variants of what? Field values? Field value combinations? Variants relative to a canonical set of values? This isn't very clear to me. Do you mean:

The smallest sample of records such that each and every value from field X appears in at least one of the sample records? i.e. combinations of values don't matter. We only require that the full domain of X, the full domain of Y, and the full domain of Z be represented.
The smallest sample of records such that each and every canonical value from field X appears in at least one of the sample records? (this would be a smaller set than the previous sample)
The smallest sample of records such that all distinct combinations of values from fields X,Y,Z are included?

Also, do you really want the smallest possible sampling of records or just to get rid of records for which all values have already been seen? Consider the sample, [A1,B1],[A1,B2],[A2,B1],[A2,B2]. If you only want to get rid of all records where all values have been seen then, assuming you don't care about combination, <code>[A1,B1],[A1,B2],[A2,B1] (a la CountZero's solution) is ok. If you really want the minimum number of records, then the right solution would be [A1,B1],[A2,B2] as pointed out by Kyle, and we've got a much more difficult optimization problem.

Best, beth

Update: - removed code sample due to failure to solve problem raised by Kyle and added a question about what the OP really means by "least".

Comment on Re: Most efficient record selection method? Select or Download Code

Replies are listed 'Best First'.
Re^2: Most efficient record selection method? by Kraythorne (Sexton) on Feb 13, 2009 at 12:09 UTC
Ok. Variants, = mixture of values contained in each field. Sample records = Live records containing these values I don't care about the combination, but we do need to have the minimum number of records so: `[A1,B1],[A1,B2],[A2,B1],[A2,B2] (variants over 2 fields)` [download] could be resolved in 3 records as you have said: `record 1 - contains [A1,B1] record 2 - contains [A1,B2] record 3 - contains [A2,B1]` [download] HOWEVER, if there are 2 records that contain: `record 1 - [A1,B1] record 2 - [A2,B2]` [download] Then these need to be identified and used in preference to the first combination as that will be the minimum number of records I need to represent all the variants in each of the fields. This gets more complicated as the number of fields and files being interrogated increases	[reply] [d/l] [select]
Re^3: Most efficient record selection method? by ELISHEVA (Prior) on Feb 13, 2009 at 13:48 UTC
Ok. Here is a start on an algorithm. The sample code contains two fake files (represented by arrays of lines) to demonstrate how to maintain state as we pass from one file to the next. It also contains a very simple optimizer that can eliminate lines that add only one new value in favor of lines that add two or more new values at once. However, that optimization is not going to find the actual smallest sample if the number of selected fields is greater than 2. It is also possible that we might find a record with two new values and then later on a record with the same two values plus a new value in a third field. You will need to replace the simple optimizer with a more general purpose optimizer. Alas, coming up with a general solution is more work than I have time for. Read more... (3 kB) Best, beth Update: Fixed small bug	[reply] [d/l]