comment on

Hello,

I hope this is the right place to ask this question -- it may be more algorithmic than language-specific.

Okay, what I'm trying to do is to sample as randomly as possible from a very large population of combinations. In particular I'm interested in things like:

158 choose 5 = ~7.7e8
158 choose 6 = ~2.0e10
158 choose 7 = ~4.3e11
158 choose 8 = ~8.0e12
[download]

At first I thought I would do so using Math::Combinatorics with something like this:

use Math::Combinatorics

my @data = ('a', 'b', 'c', 'd', 'e', 'f', 'g');
my $nmer = 3;
my $sampling_density = 2;

my $combinations = Math::Combinatorics->new(
        count   =>      $nmer,
        data    =>      [ @data ],
        );
my $to_skip = 0;

while (my @subset = $combinations->next_combination()) {
        if (0 == $to_skip) {
              $to_skip = int(rand($sampling_density));
              Process(@subset);
              }
        else {
              --$to_skip;
              next();
              }
        }
[download]

This approach would skip an average of $sampling_density / 2 combinations (the two comes in because rand should be approximating a uniform distribution), then process a combination and recompute a new number to skip.

This seems viable for things like 158 choose 4 but it will still be impractical for larger populations like 158 choose 8 because even if I generate the combinations at a rate of 10,000 per second I'll still take a decade to get through the combinations, never mind any time taken by processing.

My timing estimations indicate that I can get through about 640,000 records in two weeks, and that's about how long I have to sample each population. I can accept that the sampling is going to be highly sparse, but I would still like to make it as evenly distributed as possible. Given that the above approach (which essentially guarantees even-ness of the sampling) is impractical are there any other options?

I've considered storing all combinations in a matrix/database and randomly sampling from that, but of course that's going to eat up something on the order of 1e12 bytes of RAM or disk-space, which works out to something like 1e3 (1000) GB.

I could also (try to) re-implement the combination-generation portion of Math::Combinatorics to allow for random-sampling without generation of intermediates.

I've also considered randomly sampling n times without replacement from the original 158 and simply ignoring the fact that I may draw the same set of n twice by chance. I am fairly confident this is a realistic assumption, given that my sampling density is on the order of 0.0001% for the 158-choose-7 case but I'm wondering if there's a cleaner alternative.

Is there an easier way to randomly sample from a space of combinations without repetition? Or a more efficient sampling algorithm that the one I proposed above using Math::Combinatorics -- preferrably one that doesn't require calculation of the intervening combinations

In reply to Sampling from Combination Space by AdriftOnMemoryBliss

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.