Kraythorne has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking to use perl to help in our data processing with regard to generating a sample file which show all field variants over various different files from the least number of records.

For example, we have 5 comma separated data file with 20 fields and we have been asked to show all variants contained in each of 3 of the fields over the 5 files in the least number of records.

I could do with having a program that allows me to define as many fields as I want from any csv database and generate a sample file which shows ALL the variants from those fields in the least number of records.

I'm struggling to get my head around how to code efficiently for this so any help would be most welcome.

Cheers

Kray

  • Comment on Most efficient record selection method?

Replies are listed 'Best First'.
Re: Most efficient record selection method?
by kyle (Abbot) on Feb 12, 2009 at 16:16 UTC

    Do you want your "sample" to have mock records that show all field variants, or do you want to sample real records?

    If you're mocking records, read your data files and for each field track each variant. A hash of hashes would work well for this. See perldsc if you're not familiar with the HoH. Then you pick the field with the most variants, and that's how many records you have to have. Loop over them all, generating a record for each. The record generator will look at every other field and pick a variant that hasn't been shown yet or a random one if they've all been shown.

    If you're sampling real records, it's more complicated. You still need to have a record of every variant for every field, but for each of those you need a list of the records that provide an example. I would start with records that appear least frequently in the data structure. That is, pick a record that gives an example of a variant that doesn't appear very often. It will also have examples of more common field variants. Upon picking that record, "mark off" the variants that you've exemplified already. Keep picking records that show the least common remaining variant until you've exemplified them all.

    In case it's not obvious, I think the important part of this problem is the data structures you use to store your input data. You want something that will lead you naturally to the records you want.

      it looks like my way will be convoluted (maybe due to not having a very indepth knowledge of perl). I'll post my code once it's written and maybe someone can improve it. :-)
      Well I think I have something that works - don't really know how efficient it is so any advice would be most welcome: The Datatools::Parsecsv is a simple subroutine we use:
      sub parse_csv { my $text = shift; my $delim = shift; my $type = undef; if ($delim && $delim =~ /comma|tab|pipe|fixed/i){ ($type, $delim) = $delim =~ m/\A (.+) \s - \s (.+) \z/xms; } my @new = (); if ($delim){ while ($text =~ m{ # the first part groups the phrase inside the quotes. # see explanation of this pattern in MRE "([^\"\\]*(?:\\.[^\"\\]*)*)"$delim? | ([^$delim]+)$delim? | $delim }gx){ if (defined $+){ push(@new, $+) } else{ push(@new, '') } } push(@new, '') if substr($text, -1,1) eq ($delim); } else{ $new[0] = $text; } return @new; # list of values that were comma-separated }


      Anyway below is the code which I've tested on a few files and it appears to function. Please let me know if any of the code could be made more efficient - I have commented the code as much as possible.

      #!c:/usr/bin/perl use strict; use warnings; use Datatools::Parsecsv; my @field_selection = qw(23 24 25 26); #ARRAY INDEX LOCATIONS my @file_list = (File1, File2, File3 ); my %field_key = (); my %file_key = (); my $record_key_count = undef; #set location of the sample record output file my $proof_file = 'K:\Perl Development\Test Files\J02394_proofs.csv'; foreach my $file (@file_list){ my $header = 1; open (INFILE, "$file") || die "Cannot open file $file for reading: $!\n"; while (my $record = <INFILE>){ chomp $record; #ignore record header if ($header == 1){ $header = 0; next; } #split fields into @fields my @fields = parse_csv($record, ','); my $record_key = undef; #build up concatenated key with pipe delimiter #create field_key{field number} -> {field_key} = occurences foreach my $field_number (@field_selection){ $record_key .= '|'.$fields[$field_number]; ${$field_key{$field_number}}{$fields[$field_number]}++; } #create #file_key{filename}->{concatenated record key} = full record ${$file_key{$file}}{$record_key} = $record; } close INFILE; } open (PROOF_FILE, ">$proof_file") || die " cannot write to proof file $proof_file: $!\n"; my $built_key = undef; OPTIMISE: #Generic hash for value sorting hashes my %hash_value_to_sort = (); #keep track of smallest field tests my %smallest_field = (); #keep track of tried fields for multiple passes my %tried_field = (); #keep track of smallest fields that revert to #global_match my %exhausted = (); my $match_result = 0; #recurse built keys until match result == 1 while ($match_result == 0){ $built_key = &build_key(); $match_result = &check_key_match($built_key); } goto OPTIMISE; sub build_key{ my $still_keys_left = undef; my $appended_keys = undef; #cycle through field selection foreach my $field_number (@field_selection){ #when field keys exhausted #keep global until succesfull match #resets %exhausted if (exists $exhausted{$field_number}){ $appended_keys .= '\|' . 'global_match'; next; } #get a key for each field and build up final key my $found_key = &get_field_key($field_number); print "$field_number:" . ${$found_key}{'key'} . "\n"; $appended_keys .= '\|' . ${$found_key}{'key'}; #if field key returns global match then all #field keys have been exhausted. No need to #check for for smallest for this field anymore #so clear %smallest-field if it relates to this #field if (${$found_key}{'key'} =~ m/ global_match /xms){ $exhausted{$field_number} = '1'; if (exists $smallest_field{$field_number}){ %smallest_field = (); } } #otherwise this field still has keys left else{ $still_keys_left = 1; } #keep track of tried keys for this field incase #we have multiple passes ${$tried_field{$field_number}}{${$found_key}{'key'}} = 1; #don't bother with defining smallest once fields #exhausted go to next field if (exists $exhausted{$field_number}){ next ; } #1st definition #flag the field number and record number of #occurances that the key was found. Flag the field #as smallest. if (not defined $smallest_field{'number'}){ $smallest_field{'number'} = ${$found_key}{'number'}; $smallest_field{$field_number} = '1'; } #otherwise check current number of occurences for #this key and replace smallest if lower.Flag the #field as smallest. elsif (${$found_key}{'number'} < $smallest_field{'number'}){ $smallest_field{'number'} = ${$found_key}{'number'}; $smallest_field{$field_number} = '1'; } } #if no keys left to find, close the proof file and exit #the program if (not defined $still_keys_left){ close PROOF_FILE; exit; } #otherwise return the appended key return $appended_keys; } sub get_field_key{ #field we want to get a key for my $field = shift; #generic hash for value sorting %hash_value_to_sort = %{$field_key{$field}}; #cycle keys lowest number of occurrences first #this helps to optimise the record selection foreach my $key ( sort HashValueSort keys %hash_value_to_sort ) { #check if the field is flagged smallest occurence #only select next if key not already tried if (exists $smallest_field{$field}){ if (exists ${$tried_field{$field}}{$key}){ next; } } #return key and number of occurances return {'key' => $key, 'number' => $hash_value_to_sort{$key} }; } #if no valid key avaiable (i.e. all tried or all keys found) #return a global match for this field return {'key' => 'global_match', 'number' => '0' }; } sub check_key_match{ my $check_key = shift; #substitute with global pattern match $check_key =~ s/\|global_match/\|\.\+/g; #recurse through each file until a record key is #found in the file hash #print matching record from %file_key hash #delete matching keys from %field_key hash #delete matching record from $file_key hash foreach my $file (@file_list){ foreach my $key (keys %{$file_key{$file}}){ if ($key =~ m/\A $check_key \z/xms){ my $not_dupe = &delete_keys_found($key); if ($not_dupe == 1){ print PROOF_FILE ${$file_key{$file}}{$key} . "\n"; print STDOUT ${$file_key{$file}}{$key} . "\n"; delete ${$file_key{$file}}{$key}; } #flag match found return 1; } } } #flag no match found return 0; } sub delete_keys_found{ my $delete_keys = shift; my @delete_keys = split ('\|', $delete_keys); my $found_key = 0; #make up any blank last field after split on '|' while ($#delete_keys < $#field_selection){ push @delete_keys,''; } #ignore empty first index after split on '|' my $fld_num = 1; foreach my $field (@field_selection){; if (exists ${$field_key{$field}}{$delete_keys[$fld_num]}){ delete ${$field_key{$field}}{$delete_keys[$fld_num]}; $found_key = 1; } $fld_num++; } #if no keys found to delete then all were dupes #so flag and do not print record return $found_key; } sub HashValueSort{ $hash_value_to_sort{$b} <=> $hash_value_to_sort{$a}; }
      Yup, need to show live records. This is a bit of a brain teaser for me!

        In that case, I think you need two data structures. Here's some half-baked code.

        while ( defined( my $record = <$input_fh> ) ) { my %record_hash = rec2hash( $record ); while ( my ( $field, $value ) = each %record_hash ) { my $variant = variant_of( $value ); $variants_in{ $record }{ $field }{ $variant } = 1; push @{ $records_for{ $field }{ $variant } }, $record; } }

        Here I store for each field-variant pair, every record that has that pair. Also for each record, I have every variant that's present in it. I leave it to you to take a record and turn it into field-value pairs, and I also leave it to you to turn a particular value into a "variant" (each value may be. a variant, but I can't tell from your description).

        From here you figure out which field-variant pair you want to represent most by looping through the pairs in %records_for and finding which pair has the fewest records available. From that you get a list of records which exemplify that field-variant pair.

        Then look in %variants_in for each record and see which one represents the most different field-variant pairs. That will be one of your examples.

        Then you can delete that record from %variants_in, and you'll want to take every field-variant pair in that record and delete them from %records_for since you no longer need examples of them.

        Then go back and pick another record of interest until there aren't any more.

        Instead of a multi-level hash, it may make it easier to pick some separator and store field-variant pairs that way. That is, you do $records_for{ "$field$separator$variant" } instead of $records_for{ $field }{ $variant }. The problem with this is you have to make sure the separator never appears in your fields or variants.

        Hope this helps.

Re: Most efficient record selection method?
by CountZero (Bishop) on Feb 12, 2009 at 20:26 UTC
    Really, I do not think it has to be more difficult than this:
    use strict; my (%first, %second, %third); while (<DATA>) { my ($first, $second, $third, $fourth) = split ','; print unless exists $first{$first} and exists $second{$second} and + exists $third{$third}; $first{$first}++; $second{$second}++; $third{$third}++; } __DATA__ A1, B1, C1, first record A2, B2, C2, second record A5, B2, C3, third record A4, B1, C3, fourth record A1, B3, C5, fifth record A2, B3, C5, sixth record A3, B1, C2, seventh record A4, B2, C4, eight record A1, B3, C4, nineth record A2, B1, C1, tenth record
    Resulting in:
    A1, B1, C1, first record A2, B2, C2, second record A5, B2, C3, third record A4, B1, C3, fourth record A1, B3, C5, fifth record A3, B1, C2, seventh record A4, B2, C4, eight record
    And if you dump the hashes you find how many of each of the variants occurred.

    In a real world application you would parse the CSV data with the usual Text::CSV of course.

    Update: On reading your question again, I see that I did not take into account the fact that you wanted to have all variants shown in the least number of records. However, I do not think this is something you can exactly solve in any reasonable time, it smells "Travelling salesman" to me. But as a gut feeling , I doubt that when working with real world data, there is much to be gained by this optimization, unless you have real strange data.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      If this is your input, you output all four records instead of just the first and last.

      A1, B1, C1, first record A1, B2, C1, second record A1, B1, C2, third record A2, B2, C2, fourth record

      I think you may be right about the travelling salesman problem, though. I thought I had a pretty good way to go in my earlier notes, but now I'm not so sure. Maybe Kraythorne will come back with an implementation and tell us.

        Yes that is right. These algorithms are all extremely sensitive to the order of the data presented to them. You will grant me that your example is a purpose-built "bad" case, hence my comment that with real data, the risk of such edge cases is (much) smaller.

        Any optimization that needs to be done will have to look at all the combinations, which quickly increase in number. Monks more versed in mathematics will correct me if I'm wrong, but I guess it will be a factorial function.

        Borrowing on the technique of "Random improvement" and the "inversion" operator (as discussed in Travelling Salesman) I tried this in the following program:

        use strict; my $debug = 0; my @data = <DATA>; my $count = @data; print "Original: $count\n"; # first run which cannot be worse than the whole file my @result = try_me(@data); my $result = @result; print "Run 1: $result\n"; print @result; #we will unloop $unloop_factor records each time and see if we get a b +etter result my $unloop_factor = 3; #we will try this $number_of_runs times my $number_of_runs = 5000; foreach my $number ( 2 .. $number_of_runs + 1 ) { my @new_data = unloop( $unloop_factor, @data ); print "New data:\n @new_data" if $debug; my @new_result = try_me(@new_data); my $new_result = @new_result; print "Run $number: $new_result\n" if $debug; if ( $new_result <= $result ) { print "New result:\n @new_result" if $debug; @data = @new_data; @result = @new_result; $result = $new_result; } ## end if ( $new_result <= $result) } ## end foreach my $number ( 2 .. $number_of_runs... print "\nFinal result is: $result\n @result\n"; sub unloop { my ( $unloop_factor, @data ) = @_; my $length = @data; my $start = int( rand( $length - $unloop_factor ) ); print "\nUnloop after $start\n" if $debug; my @begin = @data[ 0 .. $start ]; my @middle = @data[ $start + 1 .. $start + $unloop_factor ]; my @end = ( @data, @data )[ $start + $unloop_factor + 1 .. $length + - 1 ]; return ( @begin, reverse(@middle), @end ); } ## end sub unloop sub try_me { my @input = @_; my @result; my ( %first, %second, %third ); foreach (@input) { my ( $first, $second, $third, $fourth ) = split ','; push @result, $_ unless exists $first{$first} and exists $second{$second} and exists $third{$third}; $first{$first}++; $second{$second}++; $third{$third}++; } ## end foreach (@input) return @result; } ## end sub try_me __DATA__ A1, B1, C1, first record* A2, B1, C3, second record A3, B1, C2, third record A1, B2, C3, fourth record A2, B2, C2, fifth record* A3, B2, C1, sixth record A1, B3, C2, seventh record A2, B3, C1, eight record A3, B3, C3, nineth record* A1, B2, C3, tenth record
        It improves the result, but rarely finds the optimal solution. Perhaps the size of the data is too small and/or the "unloop factor" is not well chosen. The unloop sub is not optimal either, as it will not swap the first record. Comments and improvements are invited!

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Most efficient record selection method?
by jethro (Monsignor) on Feb 12, 2009 at 17:03 UTC

    It may be impossible to get an optimal result, but I think this heuristic might be quite successful:

    Have 3 HashofArrays with the 3 fields as keys and an array of csv lines which have this key. In this array only the lines which are part of the minimal set are collected

    Loop through the files with a module like Text::CSV or Parse::CSV

    For every line check if any of the three fields is already in one of the three hashes. If any of the three field values is missing, add the line to the 3 hashes (i.e add line to array of hash1{$field_value1})

    If all three field values are already in the database, check whether adding this line would allow you to drop two or three other lines out of the hashes. Lets call the three values of your line a b and c. Now hash1 for a should point to (a,x,y). Check if x in hash2 has two lines in the array (can't be more than 2) and y in hash3 has two lines. If that is the case, you could remove (a,x,y). Do the same with b and c. If you collected more than one line to remove, then do it. There is the complication that you could find the same lines more than once in the three searches so you have to be careful about the edge cases.

Re: Most efficient record selection method?
by rir (Vicar) on Feb 18, 2009 at 20:55 UTC
    Update: Major update. I am going to reexamine putting my $HOME under revision control.

    Here is the algorithm.

    • For each value in a field, keep track of number of occurences in the input. Set the frequency to zero for the values when a record is moved to the output. So:
      $freq->{$fld_no}->{ $record[$fld_num] } = 0;
      is a central fragment. The score zero means the data field value is no longer needed; a score equal one means the value is needed in the results; and a score greater than one means the value is needed but which record is undetermined.
    • Repeat on the input:
      • Sort it, remove dupes.
      • Move records known to be needed.
      • Delete unneeded records.
      • A record is a duplicate if all fields are equal or the fields' scores are both zero.
        Update: That's not so clear; an example (FieldValue:FieldScore):
        Your:0 Horse:4 Blue:0
        My:0 Horse:4 Black:0
        match because the ownerships and colors don't matter.
      • A record is needed if any significant field is unique (has a score equal to one).
      • A record is deletable if one significant field has a score greater than one and the rest are zero. This is just a directly identifiable duplicate.
      • That removes the easy stuff, what is in the result needs to be added to any final solution of what remains in the input. The rest is brute force or the long walk.
      • It may be possible to split the input into smaller problems. This would involve grabbing a record and all its sisters and cousins. If we initially select: Peace Happiness Joy then we choose every record with Peace in field one, Happiness--two, Joy--three; then we repeat for every chosen record. There's no need to guess at a good starting choice, either there will be partitions or not.
      • Examine the head of every permutation of the input set, looking for the ideal solution. The ideal solution's size would be equal to the largest count of differing values for a field. If there is no ideal solution, increase the size of the head until a solution is found. In doing this count the previously created output with the head. In these situations, I tend to look for a solution then for an improvement; it allows quitting with a sub-optimal solution.

        In the above, I write of an ideal solution size. Determining the best possible and worst possible size of our answer allows breaking off processing when it's pointless.

      Possibly change the permutation machine so that each "permutation" array is treated as a wheel, reducing the number of arrays needing to be created. I think the walking is the time sink; not creating the walkways.

      Here's unguaranteed code.

      Be well,
      rir
      Ooops posted my code higher up.(i.e. completed code that appears to work)

      Update: Copied the code below so more easily found

      Just to confirm - it will accept any number of files and any selection of fields and calculate the minimum amount of records required from across all the files to demonstrate the full range of field values for the fields selected. (I think!).

      If anyone can pick up on any bugs please let me know :-)
        Update: I've copied this to where people can see it. Hopefully it will make sense to someone

        Well I think I have something that works - don't really know how efficient it is so any advice would be most welcome: The Datatools::Parsecsv is a simple subroutine we use:
        sub parse_csv { my $text = shift; my $delim = shift; my $type = undef; if ($delim && $delim =~ /comma|tab|pipe|fixed/i){ ($type, $delim) = $delim =~ m/\A (.+) \s - \s (.+) \z/xms; } my @new = (); if ($delim){ while ($text =~ m{ # the first part groups the phrase inside the quotes. # see explanation of this pattern in MRE "([^\"\\]*(?:\\.[^\"\\]*)*)"$delim? | ([^$delim]+)$delim? | $delim }gx){ if (defined $+){ push(@new, $+) } else{ push(@new, '') } } push(@new, '') if substr($text, -1,1) eq ($delim); } else{ $new[0] = $text; } return @new; # list of values that were comma-separated }


        Anyway below is the code which I've tested on a few files and it appears to function. Please let me know if any of the code could be made more efficient - I have commented the code as much as possible.

        #!c:/usr/bin/perl use strict; use warnings; use Datatools::Parsecsv; my @field_selection = qw(23 24 25 26); #ARRAY INDEX LOCATIONS my @file_list = (File1, File2, File3 ); my %field_key = (); my %file_key = (); my $record_key_count = undef; #set location of the sample record output file my $proof_file = 'K:\Perl Development\Test Files\J02394_proofs.csv'; foreach my $file (@file_list){ my $header = 1; open (INFILE, "$file") || die "Cannot open file $file for reading: $!\n"; while (my $record = <INFILE>){ chomp $record; #ignore record header if ($header == 1){ $header = 0; next; } #split fields into @fields my @fields = parse_csv($record, ','); my $record_key = undef; #build up concatenated key with pipe delimiter #create field_key{field number} -> {field_key} = occurences foreach my $field_number (@field_selection){ $record_key .= '|'.$fields[$field_number]; ${$field_key{$field_number}}{$fields[$field_number]}++; } #create #file_key{filename}->{concatenated record key} = full record ${$file_key{$file}}{$record_key} = $record; } close INFILE; } open (PROOF_FILE, ">$proof_file") || die " cannot write to proof file $proof_file: $!\n"; my $built_key = undef; OPTIMISE: #Generic hash for value sorting hashes my %hash_value_to_sort = (); #keep track of smallest field tests my %smallest_field = (); #keep track of tried fields for multiple passes my %tried_field = (); #keep track of smallest fields that revert to #global_match my %exhausted = (); my $match_result = 0; #recurse built keys until match result == 1 while ($match_result == 0){ $built_key = &build_key(); $match_result = &check_key_match($built_key); } goto OPTIMISE; sub build_key{ my $still_keys_left = undef; my $appended_keys = undef; #cycle through field selection foreach my $field_number (@field_selection){ #when field keys exhausted #keep global until succesfull match #resets %exhausted if (exists $exhausted{$field_number}){ $appended_keys .= '\|' . 'global_match'; next; } #get a key for each field and build up final key my $found_key = &get_field_key($field_number); print "$field_number:" . ${$found_key}{'key'} . "\n"; $appended_keys .= '\|' . ${$found_key}{'key'}; #if field key returns global match then all #field keys have been exhausted. No need to #check for for smallest for this field anymore #so clear %smallest-field if it relates to this #field if (${$found_key}{'key'} =~ m/ global_match /xms){ $exhausted{$field_number} = '1'; if (exists $smallest_field{$field_number}){ %smallest_field = (); } } #otherwise this field still has keys left else{ $still_keys_left = 1; } #keep track of tried keys for this field incase #we have multiple passes ${$tried_field{$field_number}}{${$found_key}{'key'}} = 1; #don't bother with defining smallest once fields #exhausted go to next field if (exists $exhausted{$field_number}){ next ; } #1st definition #flag the field number and record number of #occurances that the key was found. Flag the field #as smallest. if (not defined $smallest_field{'number'}){ $smallest_field{'number'} = ${$found_key}{'number'}; $smallest_field{$field_number} = '1'; } #otherwise check current number of occurences for #this key and replace smallest if lower.Flag the #field as smallest. elsif (${$found_key}{'number'} < $smallest_field{'number'}){ $smallest_field{'number'} = ${$found_key}{'number'}; $smallest_field{$field_number} = '1'; } } #if no keys left to find, close the proof file and exit #the program if (not defined $still_keys_left){ close PROOF_FILE; exit; } #otherwise return the appended key return $appended_keys; } sub get_field_key{ #field we want to get a key for my $field = shift; #generic hash for value sorting %hash_value_to_sort = %{$field_key{$field}}; #cycle keys lowest number of occurrences first #this helps to optimise the record selection foreach my $key ( sort HashValueSort keys %hash_value_to_sort ) { #check if the field is flagged smallest occurence #only select next if key not already tried if (exists $smallest_field{$field}){ if (exists ${$tried_field{$field}}{$key}){ next; } } #return key and number of occurances return {'key' => $key, 'number' => $hash_value_to_sort{$key} }; } #if no valid key avaiable (i.e. all tried or all keys found) #return a global match for this field return {'key' => 'global_match', 'number' => '0' }; } sub check_key_match{ my $check_key = shift; #substitute with global pattern match $check_key =~ s/\|global_match/\|\.\+/g; #recurse through each file until a record key is #found in the file hash #print matching record from %file_key hash #delete matching keys from %field_key hash #delete matching record from $file_key hash foreach my $file (@file_list){ foreach my $key (keys %{$file_key{$file}}){ if ($key =~ m/\A $check_key \z/xms){ my $not_dupe = &delete_keys_found($key); if ($not_dupe == 1){ print PROOF_FILE ${$file_key{$file}}{$key} . "\n"; print STDOUT ${$file_key{$file}}{$key} . "\n"; delete ${$file_key{$file}}{$key}; } #flag match found return 1; } } } #flag no match found return 0; } sub delete_keys_found{ my $delete_keys = shift; my @delete_keys = split ('\|', $delete_keys); my $found_key = 0; #make up any blank last field after split on '|' while ($#delete_keys < $#field_selection){ push @delete_keys,''; } #ignore empty first index after split on '|' my $fld_num = 1; foreach my $field (@field_selection){; if (exists ${$field_key{$field}}{$delete_keys[$fld_num]}){ delete ${$field_key{$field}}{$delete_keys[$fld_num]}; $found_key = 1; } $fld_num++; } #if no keys found to delete then all were dupes #so flag and do not print record return $found_key; } sub HashValueSort{ $hash_value_to_sort{$b} <=> $hash_value_to_sort{$a}; }
Re: Most efficient record selection method?
by ELISHEVA (Prior) on Feb 13, 2009 at 09:01 UTC
    For example, we have 5 comma separated data file with 20 fields and we have been asked to show all variants contained in each of 3 of the fields over the 5 files in the least number of records.

    Variants of what? Field values? Field value combinations? Variants relative to a canonical set of values? This isn't very clear to me. Do you mean:

    • The smallest sample of records such that each and every value from field X appears in at least one of the sample records? i.e. combinations of values don't matter. We only require that the full domain of X, the full domain of Y, and the full domain of Z be represented.
    • The smallest sample of records such that each and every canonical value from field X appears in at least one of the sample records? (this would be a smaller set than the previous sample)
    • The smallest sample of records such that all distinct combinations of values from fields X,Y,Z are included?

    Also, do you really want the smallest possible sampling of records or just to get rid of records for which all values have already been seen? Consider the sample, [A1,B1],[A1,B2],[A2,B1],[A2,B2]. If you only want to get rid of all records where all values have been seen then, assuming you don't care about combination, <code>[A1,B1],[A1,B2],[A2,B1] (a la CountZero's solution) is ok. If you really want the minimum number of records, then the right solution would be [A1,B1],[A2,B2] as pointed out by Kyle, and we've got a much more difficult optimization problem.

    Best, beth

    Update: - removed code sample due to failure to solve problem raised by Kyle and added a question about what the OP really means by "least".

      Ok.

      Variants, = mixture of values contained in each field.

      Sample records = Live records containing these values

      I don't care about the combination, but we do need to have the minimum number of records so:

      [A1,B1],[A1,B2],[A2,B1],[A2,B2] (variants over 2 fields)

      could be resolved in 3 records as you have said:

      record 1 - contains [A1,B1] record 2 - contains [A1,B2] record 3 - contains [A2,B1]

      HOWEVER, if there are 2 records that contain:

      record 1 - [A1,B1] record 2 - [A2,B2]

      Then these need to be identified and used in preference to the first combination as that will be the minimum number of records I need to represent all the variants in each of the fields.

      This gets more complicated as the number of fields and files being interrogated increases

        Ok. Here is a start on an algorithm. The sample code contains two fake files (represented by arrays of lines) to demonstrate how to maintain state as we pass from one file to the next. It also contains a very simple optimizer that can eliminate lines that add only one new value in favor of lines that add two or more new values at once.

        However, that optimization is not going to find the actual smallest sample if the number of selected fields is greater than 2. It is also possible that we might find a record with two new values and then later on a record with the same two values plus a new value in a third field. You will need to replace the simple optimizer with a more general purpose optimizer. Alas, coming up with a general solution is more work than I have time for.

        Best, beth

        Update: Fixed small bug