AdriftOnMemoryBliss has asked for the wisdom of the Perl Monks concerning the following question:
Hello,
I hope this is the right place to ask this question -- it may be more algorithmic than language-specific.
Okay, what I'm trying to do is to sample as randomly as possible from a very large population of combinations. In particular I'm interested in things like:
158 choose 5 = ~7.7e8 158 choose 6 = ~2.0e10 158 choose 7 = ~4.3e11 158 choose 8 = ~8.0e12
At first I thought I would do so using Math::Combinatorics with something like this:
use Math::Combinatorics my @data = ('a', 'b', 'c', 'd', 'e', 'f', 'g'); my $nmer = 3; my $sampling_density = 2; my $combinations = Math::Combinatorics->new( count => $nmer, data => [ @data ], ); my $to_skip = 0; while (my @subset = $combinations->next_combination()) { if (0 == $to_skip) { $to_skip = int(rand($sampling_density)); Process(@subset); } else { --$to_skip; next(); } }
This approach would skip an average of $sampling_density / 2 combinations (the two comes in because rand should be approximating a uniform distribution), then process a combination and recompute a new number to skip.
This seems viable for things like 158 choose 4 but it will still be impractical for larger populations like 158 choose 8 because even if I generate the combinations at a rate of 10,000 per second I'll still take a decade to get through the combinations, never mind any time taken by processing.
My timing estimations indicate that I can get through about 640,000 records in two weeks, and that's about how long I have to sample each population. I can accept that the sampling is going to be highly sparse, but I would still like to make it as evenly distributed as possible. Given that the above approach (which essentially guarantees even-ness of the sampling) is impractical are there any other options?
I've considered storing all combinations in a matrix/database and randomly sampling from that, but of course that's going to eat up something on the order of 1e12 bytes of RAM or disk-space, which works out to something like 1e3 (1000) GB.
I could also (try to) re-implement the combination-generation portion of Math::Combinatorics to allow for random-sampling without generation of intermediates.
I've also considered randomly sampling n times without replacement from the original 158 and simply ignoring the fact that I may draw the same set of n twice by chance. I am fairly confident this is a realistic assumption, given that my sampling density is on the order of 0.0001% for the 158-choose-7 case but I'm wondering if there's a cleaner alternative.
Is there an easier way to randomly sample from a space of combinations without repetition? Or a more efficient sampling algorithm that the one I proposed above using Math::Combinatorics -- preferrably one that doesn't require calculation of the intervening combinations
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Sampling from Combination Space
by blokhead (Monsignor) on Jul 18, 2005 at 05:48 UTC | |
|
Re: Sampling from Combination Space
by Zaxo (Archbishop) on Jul 18, 2005 at 06:02 UTC | |
|
Re: Sampling from Combination Space
by pg (Canon) on Jul 18, 2005 at 05:40 UTC | |
|
Re: Sampling from Combination Space
by tlm (Prior) on Jul 18, 2005 at 12:30 UTC | |
|
Re: Sampling from Combination Space (lcd)
by tye (Sage) on Jul 18, 2005 at 16:04 UTC | |
by abell (Chaplain) on Jul 19, 2005 at 08:48 UTC |