http://qs321.pair.com?node_id=475645


in reply to Sampling from Combination Space

What is generally done to sample large spaces of combinatorial objects is to use a ranking scheme. You create an easy-to-compute, 1-to-1 mapping between a set of objects and natural numbers 1 to N (where N is the number of objects). Then, since it is very easy to uniformly sample from a set of numbers, just do that as your sampling operation and use the ranking scheme to get the object that corresponds to the sampled number.

For these kinds of problems, I always consult my favorite online book, Combinatorial Generation by Frank Ruskey. On page 87 of the PDF, it derives convenient ranking and unranking algorithms for combinations. Here they are in Perl:

use Math::Pari qw[:int binomial]; use Memoize; use List::Util 'sum'; memoize 'binomial'; sub rank { my @comb = sort { $a <=> $b } @_; sum map { binomial( $comb[$_]-1, $_+1 ) } 0 .. $#comb; } # $k is the comination size # $r is the rank sub unrank { my ($k, $r) = @_; my @comb; for my $i (reverse 1 .. $k) { my $p = $i-1; $p++ until binomial($p,$i) > $r; $r -= binomial($p-1, $i); push @comb, $p; } return @comb; }
The rank function takes a combination of size k from {1 .. N} into an integer in {0 .. C(N,k)-1}. The unrank function goes the other direction. In fact, N does not even need to be known to these algorithms (and k only to the unrank algorithm).

Since it makes a lot of use of computing binomial coefficients, I've used Math::Pari's version and also memoized it for speed. However, I haven't gotten the algorithm to work with large Pari numbers (I can't remember how to work with Pari objects, and it's getting late).

Now, assuming you can get these algorithms to work on the large numbers involved, and you can sample from the appropriately large range of integers, you can perform any sampling method you want on {0 .. C(N,k)-1} and apply it to the set of combinations.

....

Having directly addressed your question, let me say that I find it hard to imagine that the naïve sampling approach (repeatedly choosing a uniformly random k-subset) would significantly skew any statistical properties. The sample space is so large and the chance of collision is so small. Are you absolutely sure that all this effort to avoid repeated samples isn't overkill?

blokhead