Re: Creating a sub-list constrained by an average while maintaining maximum number of original elements

It is also advantageous, in some cases (say, when dealing with “hundreds of thousands of files”) to do one of two things:

Use a file-based sort to sort the list. You can do many things that rely upon this property of such a list, which can in fact be arbitrarily large. These algorithms have a much smaller memory footprint and may scale better.
Consider randomizing the file, e.g. by writing a random-number field and then sorting the file by that field. Now, random selection can be made by reading the file sequentially.

Sorting is an “unexpectedly fast and efficient” approach to many things. (“Sorting and Searching” is of course the title of one of Dr. Knuth’s definitive textbooks.)

You need to ponder the trade-off between “goodness of fit” and “diversity.” Which of these two considerations is the more important? It might well be argued that the algorithm should consist of a single pass through the randomized file, always adding the new record to the pool and throwing-out a randomly selected element (say, based on the simple criteria of it being either “larger than” or “smaller than” the target value) with each iteration.

It might be prudent to model several different candidate algorithms, and then choose among them based upon both your understanding of the problem and their observed statistical characteristics.