Re^4: Most efficient record selection method?

The issue isn't just the order - it is also having more than one duplicate value in the same record. Your algorithm prints a record each time a field has a value that has never been seen before. That means that if, by chance, one encounters a record with a new value in A (all the other fields are duplicates) and then later on a record with a new value in B (all the rest being duplicates), you will have to add two records to your sample, even if both those values show up in the same record later on.

This is hardly a corner case. The probability of finding two records each with only one field with a not-yet-seen value depends on the number of possible values for each field and the number of those values found so far, not the total number of records in the dataset. The probability of there being a record with a particular combination of duplicate values depends on the size of the data set. In any sufficiently large dataset, it is highly likely that there will be a later record that contains both of those values at once. Because it came after rather than before you have 2 records where you could have had one. The more records in the data set, the greater the probability of that happening.

Best, beth

Comment on Re^4: Most efficient record selection method?