in reply to Most efficient record selection method?

For example, we have 5 comma separated data file with 20 fields and we have been asked to show all variants contained in each of 3 of the fields over the 5 files in the least number of records.

Variants of what? Field values? Field value combinations? Variants relative to a canonical set of values? This isn't very clear to me. Do you mean:

Also, do you really want the smallest possible sampling of records or just to get rid of records for which all values have already been seen? Consider the sample, [A1,B1],[A1,B2],[A2,B1],[A2,B2]. If you only want to get rid of all records where all values have been seen then, assuming you don't care about combination, <code>[A1,B1],[A1,B2],[A2,B1] (a la CountZero's solution) is ok. If you really want the minimum number of records, then the right solution would be [A1,B1],[A2,B2] as pointed out by Kyle, and we've got a much more difficult optimization problem.

Best, beth

Update: - removed code sample due to failure to solve problem raised by Kyle and added a question about what the OP really means by "least".

Replies are listed 'Best First'.
Re^2: Most efficient record selection method?
by Kraythorne (Sexton) on Feb 13, 2009 at 12:09 UTC
    Ok.

    Variants, = mixture of values contained in each field.

    Sample records = Live records containing these values

    I don't care about the combination, but we do need to have the minimum number of records so:

    [A1,B1],[A1,B2],[A2,B1],[A2,B2] (variants over 2 fields)

    could be resolved in 3 records as you have said:

    record 1 - contains [A1,B1] record 2 - contains [A1,B2] record 3 - contains [A2,B1]

    HOWEVER, if there are 2 records that contain:

    record 1 - [A1,B1] record 2 - [A2,B2]

    Then these need to be identified and used in preference to the first combination as that will be the minimum number of records I need to represent all the variants in each of the fields.

    This gets more complicated as the number of fields and files being interrogated increases

      Ok. Here is a start on an algorithm. The sample code contains two fake files (represented by arrays of lines) to demonstrate how to maintain state as we pass from one file to the next. It also contains a very simple optimizer that can eliminate lines that add only one new value in favor of lines that add two or more new values at once.

      However, that optimization is not going to find the actual smallest sample if the number of selected fields is greater than 2. It is also possible that we might find a record with two new values and then later on a record with the same two values plus a new value in a third field. You will need to replace the simple optimizer with a more general purpose optimizer. Alas, coming up with a general solution is more work than I have time for.

      Best, beth

      Update: Fixed small bug