Re: Most efficient record selection method?
by kyle (Abbot) on Feb 12, 2009 at 16:16 UTC
|
Do you want your "sample" to have mock records that show all field variants, or do you want to sample real records?
If you're mocking records, read your data files and for each field track each variant. A hash of hashes would work well for this. See perldsc if you're not familiar with the HoH. Then you pick the field with the most variants, and that's how many records you have to have. Loop over them all, generating a record for each. The record generator will look at every other field and pick a variant that hasn't been shown yet or a random one if they've all been shown.
If you're sampling real records, it's more complicated. You still need to have a record of every variant for every field, but for each of those you need a list of the records that provide an example. I would start with records that appear least frequently in the data structure. That is, pick a record that gives an example of a variant that doesn't appear very often. It will also have examples of more common field variants. Upon picking that record, "mark off" the variants that you've exemplified already. Keep picking records that show the least common remaining variant until you've exemplified them all.
In case it's not obvious, I think the important part of this problem is the data structures you use to store your input data. You want something that will lead you naturally to the records you want.
| [reply] |
|
|
it looks like my way will be convoluted (maybe due to not having a very indepth knowledge of perl). I'll post my code once it's written and maybe someone can improve it. :-)
| [reply] |
|
|
Well I think I have something that works - don't really know how efficient it is so any advice would be most welcome:
The Datatools::Parsecsv is a simple subroutine we use:
sub parse_csv {
my $text = shift;
my $delim = shift;
my $type = undef;
if ($delim && $delim =~ /comma|tab|pipe|fixed/i){
($type, $delim) = $delim =~ m/\A (.+) \s - \s (.+) \z/xms;
}
my @new = ();
if ($delim){
while ($text =~ m{
# the first part groups the phrase inside the quotes.
# see explanation of this pattern in MRE
"([^\"\\]*(?:\\.[^\"\\]*)*)"$delim?
| ([^$delim]+)$delim?
| $delim
}gx){
if (defined $+){
push(@new, $+)
}
else{
push(@new, '')
}
}
push(@new, '') if substr($text, -1,1) eq ($delim);
}
else{
$new[0] = $text;
}
return @new; # list of values that were comma-separated
}
Anyway below is the code which I've tested on a few files and it appears to function. Please let me know if any of the code could be made more efficient - I have commented the code as much as possible.
#!c:/usr/bin/perl
use strict;
use warnings;
use Datatools::Parsecsv;
my @field_selection = qw(23 24 25 26); #ARRAY INDEX LOCATIONS
my @file_list = (File1,
File2,
File3
);
my %field_key = ();
my %file_key = ();
my $record_key_count = undef;
#set location of the sample record output file
my $proof_file = 'K:\Perl Development\Test Files\J02394_proofs.csv';
foreach my $file (@file_list){
my $header = 1;
open (INFILE, "$file")
|| die "Cannot open file $file for reading: $!\n";
while (my $record = <INFILE>){
chomp $record;
#ignore record header
if ($header == 1){
$header = 0;
next;
}
#split fields into @fields
my @fields = parse_csv($record, ',');
my $record_key = undef;
#build up concatenated key with pipe delimiter
#create field_key{field number} -> {field_key} = occurences
foreach my $field_number (@field_selection){
$record_key .= '|'.$fields[$field_number];
${$field_key{$field_number}}{$fields[$field_number]}++;
}
#create
#file_key{filename}->{concatenated record key} = full record
${$file_key{$file}}{$record_key} = $record;
}
close INFILE;
}
open (PROOF_FILE, ">$proof_file")
|| die " cannot write to proof file $proof_file: $!\n";
my $built_key = undef;
OPTIMISE:
#Generic hash for value sorting hashes
my %hash_value_to_sort = ();
#keep track of smallest field tests
my %smallest_field = ();
#keep track of tried fields for multiple passes
my %tried_field = ();
#keep track of smallest fields that revert to
#global_match
my %exhausted = ();
my $match_result = 0;
#recurse built keys until match result == 1
while ($match_result == 0){
$built_key = &build_key();
$match_result = &check_key_match($built_key);
}
goto OPTIMISE;
sub build_key{
my $still_keys_left = undef;
my $appended_keys = undef;
#cycle through field selection
foreach my $field_number (@field_selection){
#when field keys exhausted
#keep global until succesfull match
#resets %exhausted
if (exists $exhausted{$field_number}){
$appended_keys .= '\|' . 'global_match';
next;
}
#get a key for each field and build up final key
my $found_key = &get_field_key($field_number);
print "$field_number:" . ${$found_key}{'key'} . "\n";
$appended_keys .= '\|' . ${$found_key}{'key'};
#if field key returns global match then all
#field keys have been exhausted. No need to
#check for for smallest for this field anymore
#so clear %smallest-field if it relates to this
#field
if (${$found_key}{'key'} =~ m/ global_match /xms){
$exhausted{$field_number} = '1';
if (exists $smallest_field{$field_number}){
%smallest_field = ();
}
}
#otherwise this field still has keys left
else{
$still_keys_left = 1;
}
#keep track of tried keys for this field incase
#we have multiple passes
${$tried_field{$field_number}}{${$found_key}{'key'}} = 1;
#don't bother with defining smallest once fields
#exhausted go to next field
if (exists $exhausted{$field_number}){
next ;
}
#1st definition
#flag the field number and record number of
#occurances that the key was found. Flag the field
#as smallest.
if (not defined $smallest_field{'number'}){
$smallest_field{'number'} = ${$found_key}{'number'};
$smallest_field{$field_number} = '1';
}
#otherwise check current number of occurences for
#this key and replace smallest if lower.Flag the
#field as smallest.
elsif (${$found_key}{'number'} < $smallest_field{'number'}){
$smallest_field{'number'} = ${$found_key}{'number'};
$smallest_field{$field_number} = '1';
}
}
#if no keys left to find, close the proof file and exit
#the program
if (not defined $still_keys_left){
close PROOF_FILE;
exit;
}
#otherwise return the appended key
return $appended_keys;
}
sub get_field_key{
#field we want to get a key for
my $field = shift;
#generic hash for value sorting
%hash_value_to_sort = %{$field_key{$field}};
#cycle keys lowest number of occurrences first
#this helps to optimise the record selection
foreach my $key ( sort HashValueSort keys %hash_value_to_sort ) {
#check if the field is flagged smallest occurence
#only select next if key not already tried
if (exists $smallest_field{$field}){
if (exists ${$tried_field{$field}}{$key}){
next;
}
}
#return key and number of occurances
return {'key' => $key,
'number' => $hash_value_to_sort{$key}
};
}
#if no valid key avaiable (i.e. all tried or all keys found)
#return a global match for this field
return {'key' => 'global_match',
'number' => '0'
};
}
sub check_key_match{
my $check_key = shift;
#substitute with global pattern match
$check_key =~ s/\|global_match/\|\.\+/g;
#recurse through each file until a record key is
#found in the file hash
#print matching record from %file_key hash
#delete matching keys from %field_key hash
#delete matching record from $file_key hash
foreach my $file (@file_list){
foreach my $key (keys %{$file_key{$file}}){
if ($key =~ m/\A $check_key \z/xms){
my $not_dupe = &delete_keys_found($key);
if ($not_dupe == 1){
print PROOF_FILE ${$file_key{$file}}{$key} . "\n";
print STDOUT ${$file_key{$file}}{$key} . "\n";
delete ${$file_key{$file}}{$key};
}
#flag match found
return 1;
}
}
}
#flag no match found
return 0;
}
sub delete_keys_found{
my $delete_keys = shift;
my @delete_keys = split ('\|', $delete_keys);
my $found_key = 0;
#make up any blank last field after split on '|'
while ($#delete_keys < $#field_selection){
push @delete_keys,'';
}
#ignore empty first index after split on '|'
my $fld_num = 1;
foreach my $field (@field_selection){;
if (exists ${$field_key{$field}}{$delete_keys[$fld_num]}){
delete ${$field_key{$field}}{$delete_keys[$fld_num]};
$found_key = 1;
}
$fld_num++;
}
#if no keys found to delete then all were dupes
#so flag and do not print record
return $found_key;
}
sub HashValueSort{
$hash_value_to_sort{$b} <=> $hash_value_to_sort{$a};
}
| [reply] [d/l] [select] |
|
|
Yup, need to show live records. This is a bit of a brain teaser for me!
| [reply] |
|
|
while ( defined( my $record = <$input_fh> ) ) {
my %record_hash = rec2hash( $record );
while ( my ( $field, $value ) = each %record_hash ) {
my $variant = variant_of( $value );
$variants_in{ $record }{ $field }{ $variant } = 1;
push @{ $records_for{ $field }{ $variant } }, $record;
}
}
Here I store for each field-variant pair, every record that has that pair. Also for each record, I have every variant that's present in it. I leave it to you to take a record and turn it into field-value pairs, and I also leave it to you to turn a particular value into a "variant" (each value may be. a variant, but I can't tell from your description).
From here you figure out which field-variant pair you want to represent most by looping through the pairs in %records_for and finding which pair has the fewest records available. From that you get a list of records which exemplify that field-variant pair.
Then look in %variants_in for each record and see which one represents the most different field-variant pairs. That will be one of your examples.
Then you can delete that record from %variants_in, and you'll want to take every field-variant pair in that record and delete them from %records_for since you no longer need examples of them.
Then go back and pick another record of interest until there aren't any more.
Instead of a multi-level hash, it may make it easier to pick some separator and store field-variant pairs that way. That is, you do $records_for{ "$field$separator$variant" } instead of $records_for{ $field }{ $variant }. The problem with this is you have to make sure the separator never appears in your fields or variants.
Hope this helps.
| [reply] [d/l] [select] |
Re: Most efficient record selection method?
by CountZero (Bishop) on Feb 12, 2009 at 20:26 UTC
|
Really, I do not think it has to be more difficult than this:
use strict;
my (%first, %second, %third);
while (<DATA>) {
my ($first, $second, $third, $fourth) = split ',';
print unless exists $first{$first} and exists $second{$second} and
+ exists $third{$third};
$first{$first}++;
$second{$second}++;
$third{$third}++;
}
__DATA__
A1, B1, C1, first record
A2, B2, C2, second record
A5, B2, C3, third record
A4, B1, C3, fourth record
A1, B3, C5, fifth record
A2, B3, C5, sixth record
A3, B1, C2, seventh record
A4, B2, C4, eight record
A1, B3, C4, nineth record
A2, B1, C1, tenth record
Resulting in:A1, B1, C1, first record
A2, B2, C2, second record
A5, B2, C3, third record
A4, B1, C3, fourth record
A1, B3, C5, fifth record
A3, B1, C2, seventh record
A4, B2, C4, eight record
And if you dump the hashes you find how many of each of the variants occurred.In a real world application you would parse the CSV data with the usual Text::CSV of course. Update: On reading your question again, I see that I did not take into account the fact that you wanted to have all variants shown in the least number of records. However, I do not think this is something you can exactly solve in any reasonable time, it smells "Travelling salesman" to me. But as a gut feeling , I doubt that when working with real world data, there is much to be gained by this optimization, unless you have real strange data.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |
|
|
A1, B1, C1, first record
A1, B2, C1, second record
A1, B1, C2, third record
A2, B2, C2, fourth record
I think you may be right about the travelling salesman problem, though. I thought I had a pretty good way to go in my earlier notes, but now I'm not so sure. Maybe Kraythorne will come back with an implementation and tell us. | [reply] [d/l] |
|
|
Yes that is right. These algorithms are all extremely sensitive to the order of the data presented to them. You will grant me that your example is a purpose-built "bad" case, hence my comment that with real data, the risk of such edge cases is (much) smaller.Any optimization that needs to be done will have to look at all the combinations, which quickly increase in number. Monks more versed in mathematics will correct me if I'm wrong, but I guess it will be a factorial function. Borrowing on the technique of "Random improvement" and the "inversion" operator (as discussed in Travelling Salesman) I tried this in the following program:
use strict;
my $debug = 0;
my @data = <DATA>;
my $count = @data;
print "Original: $count\n";
# first run which cannot be worse than the whole file
my @result = try_me(@data);
my $result = @result;
print "Run 1: $result\n";
print @result;
#we will unloop $unloop_factor records each time and see if we get a b
+etter result
my $unloop_factor = 3;
#we will try this $number_of_runs times
my $number_of_runs = 5000;
foreach my $number ( 2 .. $number_of_runs + 1 ) {
my @new_data = unloop( $unloop_factor, @data );
print "New data:\n @new_data" if $debug;
my @new_result = try_me(@new_data);
my $new_result = @new_result;
print "Run $number: $new_result\n" if $debug;
if ( $new_result <= $result ) {
print "New result:\n @new_result" if $debug;
@data = @new_data;
@result = @new_result;
$result = $new_result;
} ## end if ( $new_result <= $result)
} ## end foreach my $number ( 2 .. $number_of_runs...
print "\nFinal result is: $result\n @result\n";
sub unloop {
my ( $unloop_factor, @data ) = @_;
my $length = @data;
my $start = int( rand( $length - $unloop_factor ) );
print "\nUnloop after $start\n" if $debug;
my @begin = @data[ 0 .. $start ];
my @middle = @data[ $start + 1 .. $start + $unloop_factor ];
my @end = ( @data, @data )[ $start + $unloop_factor + 1 .. $length
+ - 1 ];
return ( @begin, reverse(@middle), @end );
} ## end sub unloop
sub try_me {
my @input = @_;
my @result;
my ( %first, %second, %third );
foreach (@input) {
my ( $first, $second, $third, $fourth ) = split ',';
push @result, $_
unless exists $first{$first}
and exists $second{$second}
and exists $third{$third};
$first{$first}++;
$second{$second}++;
$third{$third}++;
} ## end foreach (@input)
return @result;
} ## end sub try_me
__DATA__
A1, B1, C1, first record*
A2, B1, C3, second record
A3, B1, C2, third record
A1, B2, C3, fourth record
A2, B2, C2, fifth record*
A3, B2, C1, sixth record
A1, B3, C2, seventh record
A2, B3, C1, eight record
A3, B3, C3, nineth record*
A1, B2, C3, tenth record
It improves the result, but rarely finds the optimal solution. Perhaps the size of the data is too small and/or the "unloop factor" is not well chosen. The unloop sub is not optimal either, as it will not swap the first record. Comments and improvements are invited!
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |
|
|
Re: Most efficient record selection method?
by jethro (Monsignor) on Feb 12, 2009 at 17:03 UTC
|
It may be impossible to get an optimal result, but I think this heuristic might be quite successful:
Have 3 HashofArrays with the 3 fields as keys and an array of csv lines which have this key. In this array only the lines which are part of the minimal set are collected
Loop through the files with a module like Text::CSV or Parse::CSV
For every line check if any of the three fields is already in one of the three hashes. If any of the three field values is missing, add the line to the 3 hashes (i.e add line to array of hash1{$field_value1})
If all three field values are already in the database, check whether adding this line would allow you to drop two or three other lines out of the hashes. Lets call the three values of your line a b and c. Now hash1 for a should point to (a,x,y). Check if x in hash2 has two lines in the array (can't be more than 2) and y in hash3 has two lines. If that is the case, you could remove (a,x,y). Do the same with b and c. If you collected more than one line to remove, then do it. There is the complication that you could find the same lines more than once in the three searches so you have to be careful about the edge cases.
| [reply] |
Re: Most efficient record selection method?
by rir (Vicar) on Feb 18, 2009 at 20:55 UTC
|
Update: Major update. I am going to reexamine putting my $HOME under
revision control.
Here is the algorithm.
-
For each value in a field, keep track of number of occurences
in the input. Set the frequency to zero for the values when a
record is moved to the output. So:
$freq->{$fld_no}->{ $record[$fld_num] } = 0;
is a central fragment. The score zero means the data field value is no longer needed; a score equal one means the value is needed in the
results; and a score greater than one means the value is
needed but which record is undetermined.
-
Repeat on the input:
Possibly change the permutation machine so that each "permutation" array is treated as a wheel, reducing the number of arrays needing to be created. I think the walking is the time sink; not creating the walkways.
Here's unguaranteed code.
Be well,
rir
| [reply] [d/l] [select] |
|
|
Ooops posted my code higher up.(i.e. completed code that appears to work)
Update: Copied the code below so more easily found
Just to confirm - it will accept any number of files and any selection of fields and calculate the minimum amount of records required from across all the files to demonstrate the full range of field values for the fields selected. (I think!).
If anyone can pick up on any bugs please let me know :-)
| [reply] |
|
|
Update: I've copied this to where people can see it. Hopefully it will make sense to someone
Well I think I have something that works - don't really know how efficient it is so any advice would be most welcome:
The Datatools::Parsecsv is a simple subroutine we use:
sub parse_csv {
my $text = shift;
my $delim = shift;
my $type = undef;
if ($delim && $delim =~ /comma|tab|pipe|fixed/i){
($type, $delim) = $delim =~ m/\A (.+) \s - \s (.+) \z/xms;
}
my @new = ();
if ($delim){
while ($text =~ m{
# the first part groups the phrase inside the quotes.
# see explanation of this pattern in MRE
"([^\"\\]*(?:\\.[^\"\\]*)*)"$delim?
| ([^$delim]+)$delim?
| $delim
}gx){
if (defined $+){
push(@new, $+)
}
else{
push(@new, '')
}
}
push(@new, '') if substr($text, -1,1) eq ($delim);
}
else{
$new[0] = $text;
}
return @new; # list of values that were comma-separated
}
Anyway below is the code which I've tested on a few files and it appears to function. Please let me know if any of the code could be made more efficient - I have commented the code as much as possible.
#!c:/usr/bin/perl
use strict;
use warnings;
use Datatools::Parsecsv;
my @field_selection = qw(23 24 25 26); #ARRAY INDEX LOCATIONS
my @file_list = (File1,
File2,
File3
);
my %field_key = ();
my %file_key = ();
my $record_key_count = undef;
#set location of the sample record output file
my $proof_file = 'K:\Perl Development\Test Files\J02394_proofs.csv';
foreach my $file (@file_list){
my $header = 1;
open (INFILE, "$file")
|| die "Cannot open file $file for reading: $!\n";
while (my $record = <INFILE>){
chomp $record;
#ignore record header
if ($header == 1){
$header = 0;
next;
}
#split fields into @fields
my @fields = parse_csv($record, ',');
my $record_key = undef;
#build up concatenated key with pipe delimiter
#create field_key{field number} -> {field_key} = occurences
foreach my $field_number (@field_selection){
$record_key .= '|'.$fields[$field_number];
${$field_key{$field_number}}{$fields[$field_number]}++;
}
#create
#file_key{filename}->{concatenated record key} = full record
${$file_key{$file}}{$record_key} = $record;
}
close INFILE;
}
open (PROOF_FILE, ">$proof_file")
|| die " cannot write to proof file $proof_file: $!\n";
my $built_key = undef;
OPTIMISE:
#Generic hash for value sorting hashes
my %hash_value_to_sort = ();
#keep track of smallest field tests
my %smallest_field = ();
#keep track of tried fields for multiple passes
my %tried_field = ();
#keep track of smallest fields that revert to
#global_match
my %exhausted = ();
my $match_result = 0;
#recurse built keys until match result == 1
while ($match_result == 0){
$built_key = &build_key();
$match_result = &check_key_match($built_key);
}
goto OPTIMISE;
sub build_key{
my $still_keys_left = undef;
my $appended_keys = undef;
#cycle through field selection
foreach my $field_number (@field_selection){
#when field keys exhausted
#keep global until succesfull match
#resets %exhausted
if (exists $exhausted{$field_number}){
$appended_keys .= '\|' . 'global_match';
next;
}
#get a key for each field and build up final key
my $found_key = &get_field_key($field_number);
print "$field_number:" . ${$found_key}{'key'} . "\n";
$appended_keys .= '\|' . ${$found_key}{'key'};
#if field key returns global match then all
#field keys have been exhausted. No need to
#check for for smallest for this field anymore
#so clear %smallest-field if it relates to this
#field
if (${$found_key}{'key'} =~ m/ global_match /xms){
$exhausted{$field_number} = '1';
if (exists $smallest_field{$field_number}){
%smallest_field = ();
}
}
#otherwise this field still has keys left
else{
$still_keys_left = 1;
}
#keep track of tried keys for this field incase
#we have multiple passes
${$tried_field{$field_number}}{${$found_key}{'key'}} = 1;
#don't bother with defining smallest once fields
#exhausted go to next field
if (exists $exhausted{$field_number}){
next ;
}
#1st definition
#flag the field number and record number of
#occurances that the key was found. Flag the field
#as smallest.
if (not defined $smallest_field{'number'}){
$smallest_field{'number'} = ${$found_key}{'number'};
$smallest_field{$field_number} = '1';
}
#otherwise check current number of occurences for
#this key and replace smallest if lower.Flag the
#field as smallest.
elsif (${$found_key}{'number'} < $smallest_field{'number'}){
$smallest_field{'number'} = ${$found_key}{'number'};
$smallest_field{$field_number} = '1';
}
}
#if no keys left to find, close the proof file and exit
#the program
if (not defined $still_keys_left){
close PROOF_FILE;
exit;
}
#otherwise return the appended key
return $appended_keys;
}
sub get_field_key{
#field we want to get a key for
my $field = shift;
#generic hash for value sorting
%hash_value_to_sort = %{$field_key{$field}};
#cycle keys lowest number of occurrences first
#this helps to optimise the record selection
foreach my $key ( sort HashValueSort keys %hash_value_to_sort ) {
#check if the field is flagged smallest occurence
#only select next if key not already tried
if (exists $smallest_field{$field}){
if (exists ${$tried_field{$field}}{$key}){
next;
}
}
#return key and number of occurances
return {'key' => $key,
'number' => $hash_value_to_sort{$key}
};
}
#if no valid key avaiable (i.e. all tried or all keys found)
#return a global match for this field
return {'key' => 'global_match',
'number' => '0'
};
}
sub check_key_match{
my $check_key = shift;
#substitute with global pattern match
$check_key =~ s/\|global_match/\|\.\+/g;
#recurse through each file until a record key is
#found in the file hash
#print matching record from %file_key hash
#delete matching keys from %field_key hash
#delete matching record from $file_key hash
foreach my $file (@file_list){
foreach my $key (keys %{$file_key{$file}}){
if ($key =~ m/\A $check_key \z/xms){
my $not_dupe = &delete_keys_found($key);
if ($not_dupe == 1){
print PROOF_FILE ${$file_key{$file}}{$key} . "\n";
print STDOUT ${$file_key{$file}}{$key} . "\n";
delete ${$file_key{$file}}{$key};
}
#flag match found
return 1;
}
}
}
#flag no match found
return 0;
}
sub delete_keys_found{
my $delete_keys = shift;
my @delete_keys = split ('\|', $delete_keys);
my $found_key = 0;
#make up any blank last field after split on '|'
while ($#delete_keys < $#field_selection){
push @delete_keys,'';
}
#ignore empty first index after split on '|'
my $fld_num = 1;
foreach my $field (@field_selection){;
if (exists ${$field_key{$field}}{$delete_keys[$fld_num]}){
delete ${$field_key{$field}}{$delete_keys[$fld_num]};
$found_key = 1;
}
$fld_num++;
}
#if no keys found to delete then all were dupes
#so flag and do not print record
return $found_key;
}
sub HashValueSort{
$hash_value_to_sort{$b} <=> $hash_value_to_sort{$a};
}
| [reply] [d/l] [select] |
Re: Most efficient record selection method?
by ELISHEVA (Prior) on Feb 13, 2009 at 09:01 UTC
|
For example, we have 5 comma separated data file with 20 fields and we have been asked to show all variants contained in each of 3 of the fields over the 5 files in the least number of records.
Variants of what? Field values? Field value combinations? Variants relative to a canonical set of values? This isn't very clear to me. Do you mean:
- The smallest sample of records such that each and every value from field X appears in at least one of the sample records? i.e. combinations of values don't matter. We only require that the full domain of X, the full domain of Y, and the full domain of Z be represented.
- The smallest sample of records such that each and every canonical value from field X appears in at least one of the sample records? (this would be a smaller set than the previous sample)
- The smallest sample of records such that all distinct combinations of values from fields X,Y,Z are included?
Also, do you really want the smallest possible sampling of records or just to get rid of records for which all values have already been seen? Consider the sample, [A1,B1],[A1,B2],[A2,B1],[A2,B2]. If you only want to get rid of all records where all values have been seen then, assuming you don't care about combination, <code>[A1,B1],[A1,B2],[A2,B1] (a la CountZero's solution) is ok. If you really want the minimum number of records, then the right solution would be [A1,B1],[A2,B2] as pointed out by Kyle, and we've got a much more difficult optimization problem.
Best, beth
Update: - removed code sample due to failure to solve problem raised by Kyle and added a question about what the OP really means by "least".
| [reply] [d/l] [select] |
|
|
Ok.
Variants, = mixture of values contained in each field.
Sample records = Live records containing these values
I don't care about the combination, but we do need to have the minimum number of records so:
[A1,B1],[A1,B2],[A2,B1],[A2,B2] (variants over 2 fields)
could be resolved in 3 records as you have said:
record 1 - contains [A1,B1]
record 2 - contains [A1,B2]
record 3 - contains [A2,B1]
HOWEVER, if there are 2 records that contain:
record 1 - [A1,B1]
record 2 - [A2,B2]
Then these need to be identified and used in preference to the first combination as that will be the minimum number of records I need to represent all the variants in each of the fields.
This gets more complicated as the number of fields and files being interrogated increases
| [reply] [d/l] [select] |
|
|
Ok. Here is a start on an algorithm. The sample code contains two fake files (represented by arrays of lines) to demonstrate how to maintain state as we pass from one file to the next. It also contains a very simple optimizer that can eliminate lines that add only one new value in favor of lines that add two or more new values at once.
However, that optimization is not going to find the actual smallest sample if the number of selected fields is greater than 2. It is also possible that we might find a record with two new values and then later on a record with the same two values plus a new value in a third field. You will need to replace the simple optimizer with a more general purpose optimizer. Alas, coming up with a general solution is more work than I have time for.
Best, beth
Update: Fixed small bug
| [reply] [d/l] |