Hello,

I hope this is the right place to ask this question -- it may be more algorithmic than language-specific.

Okay, what I'm trying to do is to sample as randomly as possible from a very large population of combinations. In particular I'm interested in things like:

158 choose 5 = ~7.7e8 158 choose 6 = ~2.0e10 158 choose 7 = ~4.3e11 158 choose 8 = ~8.0e12

At first I thought I would do so using Math::Combinatorics with something like this:

use Math::Combinatorics my @data = ('a', 'b', 'c', 'd', 'e', 'f', 'g'); my $nmer = 3; my $sampling_density = 2; my $combinations = Math::Combinatorics->new( count => $nmer, data => [ @data ], ); my $to_skip = 0; while (my @subset = $combinations->next_combination()) { if (0 == $to_skip) { $to_skip = int(rand($sampling_density)); Process(@subset); } else { --$to_skip; next(); } }

This approach would skip an average of $sampling_density / 2 combinations (the two comes in because rand should be approximating a uniform distribution), then process a combination and recompute a new number to skip.

This seems viable for things like 158 choose 4 but it will still be impractical for larger populations like 158 choose 8 because even if I generate the combinations at a rate of 10,000 per second I'll still take a decade to get through the combinations, never mind any time taken by processing.

My timing estimations indicate that I can get through about 640,000 records in two weeks, and that's about how long I have to sample each population. I can accept that the sampling is going to be highly sparse, but I would still like to make it as evenly distributed as possible. Given that the above approach (which essentially guarantees even-ness of the sampling) is impractical are there any other options?

I've considered storing all combinations in a matrix/database and randomly sampling from that, but of course that's going to eat up something on the order of 1e12 bytes of RAM or disk-space, which works out to something like 1e3 (1000) GB.

I could also (try to) re-implement the combination-generation portion of Math::Combinatorics to allow for random-sampling without generation of intermediates.

I've also considered randomly sampling n times without replacement from the original 158 and simply ignoring the fact that I may draw the same set of n twice by chance. I am fairly confident this is a realistic assumption, given that my sampling density is on the order of 0.0001% for the 158-choose-7 case but I'm wondering if there's a cleaner alternative.

Is there an easier way to randomly sample from a space of combinations without repetition? Or a more efficient sampling algorithm that the one I proposed above using Math::Combinatorics -- preferrably one that doesn't require calculation of the intervening combinations


In reply to Sampling from Combination Space by AdriftOnMemoryBliss

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.