in reply to Re^2: Most efficient record selection method?
in thread Most efficient record selection method?

Yes that is right. These algorithms are all extremely sensitive to the order of the data presented to them. You will grant me that your example is a purpose-built "bad" case, hence my comment that with real data, the risk of such edge cases is (much) smaller.

Any optimization that needs to be done will have to look at all the combinations, which quickly increase in number. Monks more versed in mathematics will correct me if I'm wrong, but I guess it will be a factorial function.

Borrowing on the technique of "Random improvement" and the "inversion" operator (as discussed in Travelling Salesman) I tried this in the following program:

use strict; my $debug = 0; my @data = <DATA>; my $count = @data; print "Original: $count\n"; # first run which cannot be worse than the whole file my @result = try_me(@data); my $result = @result; print "Run 1: $result\n"; print @result; #we will unloop $unloop_factor records each time and see if we get a b +etter result my $unloop_factor = 3; #we will try this $number_of_runs times my $number_of_runs = 5000; foreach my $number ( 2 .. $number_of_runs + 1 ) { my @new_data = unloop( $unloop_factor, @data ); print "New data:\n @new_data" if $debug; my @new_result = try_me(@new_data); my $new_result = @new_result; print "Run $number: $new_result\n" if $debug; if ( $new_result <= $result ) { print "New result:\n @new_result" if $debug; @data = @new_data; @result = @new_result; $result = $new_result; } ## end if ( $new_result <= $result) } ## end foreach my $number ( 2 .. $number_of_runs... print "\nFinal result is: $result\n @result\n"; sub unloop { my ( $unloop_factor, @data ) = @_; my $length = @data; my $start = int( rand( $length - $unloop_factor ) ); print "\nUnloop after $start\n" if $debug; my @begin = @data[ 0 .. $start ]; my @middle = @data[ $start + 1 .. $start + $unloop_factor ]; my @end = ( @data, @data )[ $start + $unloop_factor + 1 .. $length + - 1 ]; return ( @begin, reverse(@middle), @end ); } ## end sub unloop sub try_me { my @input = @_; my @result; my ( %first, %second, %third ); foreach (@input) { my ( $first, $second, $third, $fourth ) = split ','; push @result, $_ unless exists $first{$first} and exists $second{$second} and exists $third{$third}; $first{$first}++; $second{$second}++; $third{$third}++; } ## end foreach (@input) return @result; } ## end sub try_me __DATA__ A1, B1, C1, first record* A2, B1, C3, second record A3, B1, C2, third record A1, B2, C3, fourth record A2, B2, C2, fifth record* A3, B2, C1, sixth record A1, B3, C2, seventh record A2, B3, C1, eight record A3, B3, C3, nineth record* A1, B2, C3, tenth record
It improves the result, but rarely finds the optimal solution. Perhaps the size of the data is too small and/or the "unloop factor" is not well chosen. The unloop sub is not optimal either, as it will not swap the first record. Comments and improvements are invited!

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Replies are listed 'Best First'.
Re^4: Most efficient record selection method?
by ELISHEVA (Prior) on Feb 13, 2009 at 12:16 UTC

    The issue isn't just the order - it is also having more than one duplicate value in the same record. Your algorithm prints a record each time a field has a value that has never been seen before. That means that if, by chance, one encounters a record with a new value in A (all the other fields are duplicates) and then later on a record with a new value in B (all the rest being duplicates), you will have to add two records to your sample, even if both those values show up in the same record later on.

    This is hardly a corner case. The probability of finding two records each with only one field with a not-yet-seen value depends on the number of possible values for each field and the number of those values found so far, not the total number of records in the dataset. The probability of there being a record with a particular combination of duplicate values depends on the size of the data set. In any sufficiently large dataset, it is highly likely that there will be a later record that contains both of those values at once. Because it came after rather than before you have 2 records where you could have had one. The more records in the data set, the greater the probability of that happening.

    Best, beth