Re: Most efficient record selection method?

Really, I do not think it has to be more difficult than this:

use strict;

my (%first, %second, %third);

while (<DATA>) {
    my ($first, $second, $third, $fourth) = split ',';
    print unless exists $first{$first} and exists $second{$second} and
+ exists $third{$third};
    $first{$first}++;
    $second{$second}++;
    $third{$third}++;
}

__DATA__
A1, B1, C1, first record
A2, B2, C2, second record
A5, B2, C3, third record
A4, B1, C3, fourth record
A1, B3, C5, fifth record
A2, B3, C5, sixth record
A3, B1, C2, seventh record
A4, B2, C4, eight record
A1, B3, C4, nineth record
A2, B1, C1, tenth record
[download]

Resulting in:

A1, B1, C1, first record
A2, B2, C2, second record
A5, B2, C3, third record
A4, B1, C3, fourth record
A1, B3, C5, fifth record
A3, B1, C2, seventh record
A4, B2, C4, eight record
[download]

And if you dump the hashes you find how many of each of the variants occurred.

In a real world application you would parse the CSV data with the usual Text::CSV of course.

Update: On reading your question again, I see that I did not take into account the fact that you wanted to have all variants shown in the least number of records. However, I do not think this is something you can exactly solve in any reasonable time, it smells "Travelling salesman" to me. But as a gut feeling , I doubt that when working with real world data, there is much to be gained by this optimization, unless you have real strange data.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Comment on Re: Most efficient record selection method? Select or Download Code

Replies are listed 'Best First'.
Re^2: Most efficient record selection method? by kyle (Abbot) on Feb 12, 2009 at 20:39 UTC
If this is your input, you output all four records instead of just the first and last. `A1, B1, C1, first record A1, B2, C1, second record A1, B1, C2, third record A2, B2, C2, fourth record` [download] I think you may be right about the travelling salesman problem, though. I thought I had a pretty good way to go in my earlier notes, but now I'm not so sure. Maybe Kraythorne will come back with an implementation and tell us.	[reply] [d/l]
Re^3: Most efficient record selection method? by CountZero (Bishop) on Feb 12, 2009 at 22:59 UTC
Yes that is right. These algorithms are all extremely sensitive to the order of the data presented to them. You will grant me that your example is a purpose-built "bad" case, hence my comment that with real data, the risk of such edge cases is (much) smaller. Any optimization that needs to be done will have to look at all the combinations, which quickly increase in number. Monks more versed in mathematics will correct me if I'm wrong, but I guess it will be a factorial function. Borrowing on the technique of "Random improvement" and the "inversion" operator (as discussed in Travelling Salesman) I tried this in the following program: use strict; my $debug = 0; my @data = <DATA>; my $count = @data; print "Original: $count\n"; # first run which cannot be worse than the whole file my @result = try_me(@data); my $result = @result; print "Run 1: $result\n"; print @result; #we will unloop $unloop_factor records each time and see if we get a b +etter result my $unloop_factor = 3; #we will try this $number_of_runs times my $number_of_runs = 5000; foreach my $number ( 2 .. $number_of_runs + 1 ) { my @new_data = unloop( $unloop_factor, @data ); print "New data:\n @new_data" if $debug; my @new_result = try_me(@new_data); my $new_result = @new_result; print "Run $number: $new_result\n" if $debug; if ( $new_result <= $result ) { print "New result:\n @new_result" if $debug; @data = @new_data; @result = @new_result; $result = $new_result; } ## end if ( $new_result <= $result) } ## end foreach my $number ( 2 .. $number_of_runs... print "\nFinal result is: $result\n @result\n"; sub unloop { my ( $unloop_factor, @data ) = @_; my $length = @data; my $start = int( rand( $length - $unloop_factor ) ); print "\nUnloop after $start\n" if $debug; my @begin = @data[ 0 .. $start ]; my @middle = @data[ $start + 1 .. $start + $unloop_factor ]; my @end = ( @data, @data )[ $start + $unloop_factor + 1 .. $length + - 1 ]; return ( @begin, reverse(@middle), @end ); } ## end sub unloop sub try_me { my @input = @_; my @result; my ( %first, %second, %third ); foreach (@input) { my ( $first, $second, $third, $fourth ) = split ','; push @result, $_ unless exists $first{$first} and exists $second{$second} and exists $third{$third}; $first{$first}++; $second{$second}++; $third{$third}++; } ## end foreach (@input) return @result; } ## end sub try_me __DATA__ A1, B1, C1, first record* A2, B1, C3, second record A3, B1, C2, third record A1, B2, C3, fourth record A2, B2, C2, fifth record* A3, B2, C1, sixth record A1, B3, C2, seventh record A2, B3, C1, eight record A3, B3, C3, nineth record* A1, B2, C3, tenth record [download] It improves the result, but rarely finds the optimal solution. Perhaps the size of the data is too small and/or the "unloop factor" is not well chosen. The `unloop` sub is not optimal either, as it will not swap the first record. Comments and improvements are invited! CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^4: Most efficient record selection method? by ELISHEVA (Prior) on Feb 13, 2009 at 12:16 UTC
The issue isn't just the order - it is also having more than one duplicate value in the same record. Your algorithm prints a record each time a field has a value that has never been seen before. That means that if, by chance, one encounters a record with a new value in A (all the other fields are duplicates) and then later on a record with a new value in B (all the rest being duplicates), you will have to add two records to your sample, even if both those values show up in the same record later on. This is hardly a corner case. The probability of finding two records each with only one field with a not-yet-seen value depends on the number of possible values for each field and the number of those values found so far, not the total number of records in the dataset. The probability of there being a record with a particular combination of duplicate values depends on the size of the data set. In any sufficiently large dataset, it is highly likely that there will be a later record that contains both of those values at once. Because it came after rather than before you have 2 records where you could have had one. The more records in the data set, the greater the probability of that happening. Best, beth	[reply]