comment on

Yes that is right. These algorithms are all extremely sensitive to the order of the data presented to them. You will grant me that your example is a purpose-built "bad" case, hence my comment that with real data, the risk of such edge cases is (much) smaller.

Any optimization that needs to be done will have to look at all the combinations, which quickly increase in number. Monks more versed in mathematics will correct me if I'm wrong, but I guess it will be a factorial function.

Borrowing on the technique of "Random improvement" and the "inversion" operator (as discussed in Travelling Salesman) I tried this in the following program:

use strict;

my $debug = 0;

my @data  = <DATA>;
my $count = @data;
print "Original: $count\n";

# first run which cannot be worse than the whole file
my @result = try_me(@data);
my $result = @result;
print "Run 1:  $result\n";
print @result;

#we will unloop $unloop_factor records each time and see if we get a b
+etter result
my $unloop_factor = 3;

#we will try this $number_of_runs times
my $number_of_runs = 5000;

foreach my $number ( 2 .. $number_of_runs + 1 ) {
    my @new_data = unloop( $unloop_factor, @data );
    print "New data:\n @new_data" if $debug;
    my @new_result = try_me(@new_data);
    my $new_result = @new_result;
    print "Run $number: $new_result\n" if $debug;
    if ( $new_result <= $result ) {
        print "New result:\n @new_result" if $debug;
        @data   = @new_data;
        @result = @new_result;
        $result = $new_result;
    } ## end if ( $new_result <= $result)
} ## end foreach my $number ( 2 .. $number_of_runs...
print "\nFinal result is: $result\n @result\n";

sub unloop {
    my ( $unloop_factor, @data ) = @_;
    my $length = @data;
    my $start  = int( rand( $length - $unloop_factor ) );
    print "\nUnloop after $start\n" if $debug;
    my @begin  = @data[ 0 .. $start ];
    my @middle = @data[ $start + 1 .. $start + $unloop_factor ];
    my @end = ( @data, @data )[ $start + $unloop_factor + 1 .. $length
+ - 1 ];
    return ( @begin, reverse(@middle), @end );
} ## end sub unloop

sub try_me {
    my @input = @_;
    my @result;
    my ( %first, %second, %third );
    foreach (@input) {
        my ( $first, $second, $third, $fourth ) = split ',';
        push @result, $_
            unless exists $first{$first}
                and exists $second{$second}
                and exists $third{$third};
        $first{$first}++;
        $second{$second}++;
        $third{$third}++;
    } ## end foreach (@input)
    return @result;
} ## end sub try_me

__DATA__
A1, B1, C1, first record*
A2, B1, C3, second record
A3, B1, C2, third record
A1, B2, C3, fourth record
A2, B2, C2, fifth record*
A3, B2, C1, sixth record
A1, B3, C2, seventh record
A2, B3, C1, eight record
A3, B3, C3, nineth record*
A1, B2, C3, tenth record
[download]

It improves the result, but rarely finds the optimal solution. Perhaps the size of the data is too small and/or the "unloop factor" is not well chosen. The unloop sub is not optimal either, as it will not swap the first record. Comments and improvements are invited!

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

In reply to Re^3: Most efficient record selection method? by CountZero
in thread Most efficient record selection method? by Kraythorne

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.