Drawing samples from a set

crenz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Drawing samples from a set by swiftone (Curate) on Jan 28, 2003 at 15:44 UTC
If your data set is hundreds of items (i.e. not thousands), you don't really have an efficiency problem. Select $r, then start through your list, adding the probability of each item to a total until the total > $r, at which point you have your choice. Linear time. For N items, worst case is N checks, average case N/2. I'm not a math major, but I don't really see an alternative if your probabilities are not predictable in advance. If they are, you can tackle an algorithm without going through the list, but you imply this is not the case. Here is one idea though. This is off the top of my head, doubtless it needs some refinement: If your probabilities don't change often, you could pre-sort the list into "chunks" of recorded sizes (but near a target, frex 0.1) which would make your select time near constant to select the correct chunk, then linear inside that chunk (where you have notably fewer items). If your probabilities change often though, this sorting may not be worthwhile. Example: a b c d e f g h i j with probabilities 0.02, 0.05, 0.20, 0.11, 0.1, 0.14, 0.08, 0.08, 0.09, 0.11) We can sort this into a data structure like: `[ { Total => 0.07, Set => [ { a=> 0.02, b => 0.07}] }, { Total => 0.27, Set => [ { c => 0.20}] }, { Total => 0.27, #repeat since it span's two "subsets" Set => [ { c=> 0.20 }] }, { Total => 0.38, Set => [ { d => 0.11 } ] }, #etc ]` [download] So the end result is that you can examine `$sorted->[int($r*10)]`, move up down inside sorted (usually no more than one element, depending on the differences between your probablities), then go through the subset in that chunk linearly. But again, this sorting will be much more expensive than a single linear search, so you'll have to decide if it's worth it.	[reply] [d/l] [select]
Re: Re: Drawing samples from a set by John M. Dlugosz (Monsignor) on Jan 28, 2003 at 17:10 UTC
If he creates another list of n items listing the cumulative probability to that point, then he can do a binary search. That's more general then the pre-computed "chunks".	[reply]
Re: Drawing samples from a set by rdfield (Priest) on Jan 28, 2003 at 15:20 UTC
Not a completely serious answer, but this module made me laugh, and may be of use, e.g. `use vague; print some of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10; print nearly all of "And did those feet in ancient times walk upon Eng +land's mountains green."; print hardly any of "And did those feet in ancient times walk upon Eng +land's mountains green.";` [download] rdfield	[reply] [d/l]
Re: Drawing samples from a set by traveler (Parson) on Jan 28, 2003 at 16:06 UTC
I have not yet used it, but Algorthm::Evolutionary::Wheel sounds a lot like what you are looking for. (I looked at it for a similar application, but found a different solution). HTH, --traveler	[reply]
Re: Drawing samples from a set by waswas-fng (Curate) on Jan 28, 2003 at 15:34 UTC
you could generate a weighted list and random a number that is in the range of 0 - (n-1) where n is the max index. This aproch has some overhead of generating the array each change of the probabilities but is efficiant in doing the lookup. in your example the array would contain 1,1,2,2,2,3,3,3,3,3 and the random number generated would be 0 - 9 when used as an index it would give you a weighted chance to pull a 3 50% of the time. -Waswas	[reply]
Re: Drawing samples from a set by tall_man (Parson) on Jan 28, 2003 at 16:29 UTC
Sounds to me like you could create a Huffman encoding tree for your probabilities (see Algorithm::Huffman). Take one branch or the other with the probability of the sum of the probabilities in that branch. The binary tree will lead to the most frequent choices being found faster.	[reply]
Re: Drawing samples from a set by blokhead (Monsignor) on Jan 28, 2003 at 17:39 UTC
My Math 378 professor actually just touched on this algorithm the other day: Take the sum of all the relative weights and call it N Take a random number between 0 and 1, and multiply it by N. Call the new number X. Iterating through each element of the list do the following: Subtract the weight of the element from X. If X went negative, you are sitting on the chosen element, so quit. Otherwise, move to the next element. Works nicely on both real and integer weights, and doesn't need extra storage to fill up a normalized array with duplicates. Also, the relative weights don't need to be normalized to probabilities. blokhead	[reply]
Re: Drawing samples from a set by artist (Parson) on Jan 28, 2003 at 15:30 UTC
Not much optimization here, but provides a good starting point who want to take it further. `my $hash = { 1 => 0.2, 2 => 0.3, 3 => 0.5 }; my $sum; my $hash2; while(my($key,$value) = each %{$hash}){ $sum += $value; print "$sum . $key\n"; $hash2->{$sum} = $key; } foreach (1..10){ my $x = rand(1); print "$x\t"; foreach (sort { $a <=> $b } keys %{$hash2}){ if($x < $_){ print $hash2->{$_},"\n"; last; } } }` [download] artist	[reply] [d/l]
Re: Drawing samples from a set by no_slogan (Deacon) on Mar 14, 2003 at 00:28 UTC
I gather that you're interested in sampling "without replacement" as they say, so the same item can't be drawn twice. If you're only sampling a few items out of hundreds, it may be best to simply reject duplicates. Keep a hash of items you've already selected, and loop until you get a new one. If the probability distribution is highly skewed, or you're taking a large sample, then you're likely to waste a lot of time picking duplicates with this scheme. The problem is -- this is not very efficient, especially if you have hundreds of items whose probabilities change all the time. It's often not necessary to actually compute the probability of each item; you can simply work with weights that are proportional to the probability. An item can be effectively removed from the set by setting its weight to zero. The other weights don't need to change. This thread has a number of good suggestions for the case where the weights do not change. However, if even one of them changes, everything has to be re-examined to update the data structure, at a cost of O(n). I have a solution below which stores the weights in a balanced binary tree instead. Only the ancestors in the tree need to be visited when something is changed, so the cost is O(log n) for a single change. Choosing a random element is O(log n); choosing a sample of k of them is O(k*log n). The code is not terribly pretty, but I think it's fairly efficient. Only the tree-building routine is recursive. Read more... (3 kB)	[reply] [d/l]
Re: Drawing samples from a set by jdporter (Paladin) on Jan 28, 2003 at 19:30 UTC
Something very similar to this was discussed before; you might something useful in this thread. jdporter The 6th Rule of Perl Club is -- There is no Rule #6.	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks