Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Drawing samples from a set

by crenz (Priest)
on Jan 28, 2003 at 15:03 UTC ( [id://230605]=perlquestion: print w/replies, xml ) Need Help??

crenz has asked for the wisdom of the Perl Monks concerning the following question:

For a certain project, I need to draw samples from a set of a few hundred items. I know about Abigail's Sample.pm, but it is not applicable for this problem: Every item has a probability attached.

My basic idea to do is is to add all items up. Suppose we have three items (1, 2, 3) with probabilites (0.2, 0.3, 0.5). The probabilites add up to 1, and I could just get a random number $r between 0 and 1. If $r < 0.2, then the selection is 1, else if $r < 0.5, the selection is 2, else the selection is 3. The problem is -- this is not very efficient, especially if you have hundreds of items whose probabilities change all the time.

Does anybody have any ideas on how to optimize this?

Replies are listed 'Best First'.
Re: Drawing samples from a set
by swiftone (Curate) on Jan 28, 2003 at 15:44 UTC
    If your data set is hundreds of items (i.e. not thousands), you don't really have an efficiency problem. Select $r, then start through your list, adding the probability of each item to a total until the total > $r, at which point you have your choice. Linear time. For N items, worst case is N checks, average case N/2.

    I'm not a math major, but I don't really see an alternative if your probabilities are not predictable in advance. If they are, you can tackle an algorithm without going through the list, but you imply this is not the case.

    Here is one idea though. This is off the top of my head, doubtless it needs some refinement:

    If your probabilities don't change often, you could pre-sort the list into "chunks" of recorded sizes (but near a target, frex 0.1) which would make your select time near constant to select the correct chunk, then linear inside that chunk (where you have notably fewer items). If your probabilities change often though, this sorting may not be worthwhile.

    Example: a b c d e f g h i j with probabilities 0.02, 0.05, 0.20, 0.11, 0.1, 0.14, 0.08, 0.08, 0.09, 0.11)

    We can sort this into a data structure like:

    [ { Total => 0.07, Set => [ { a=> 0.02, b => 0.07}] }, { Total => 0.27, Set => [ { c => 0.20}] }, { Total => 0.27, #repeat since it span's two "subsets" Set => [ { c=> 0.20 }] }, { Total => 0.38, Set => [ { d => 0.11 } ] }, #etc ]
    So the end result is that you can examine $sorted->[int($r*10)], move up down inside sorted (usually no more than one element, depending on the differences between your probablities), then go through the subset in that chunk linearly.

    But again, this sorting will be much more expensive than a single linear search, so you'll have to decide if it's worth it.

      If he creates another list of n items listing the cumulative probability to that point, then he can do a binary search. That's more general then the pre-computed "chunks".
Re: Drawing samples from a set
by rdfield (Priest) on Jan 28, 2003 at 15:20 UTC
    Not a completely serious answer, but this module made me laugh, and may be of use, e.g.
    use vague; print some of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10; print nearly all of "And did those feet in ancient times walk upon Eng +land's mountains green."; print hardly any of "And did those feet in ancient times walk upon Eng +land's mountains green.";

    rdfield

Re: Drawing samples from a set
by traveler (Parson) on Jan 28, 2003 at 16:06 UTC
    I have not yet used it, but Algorthm::Evolutionary::Wheel sounds a lot like what you are looking for. (I looked at it for a similar application, but found a different solution).

    HTH, --traveler

Re: Drawing samples from a set
by waswas-fng (Curate) on Jan 28, 2003 at 15:34 UTC
    you could generate a weighted list and random a number that is in the range of 0 - (n-1) where n is the max index. This aproch has some overhead of generating the array each change of the probabilities but is efficiant in doing the lookup. in your example the array would contain 1,1,2,2,2,3,3,3,3,3 and the random number generated would be 0 - 9 when used as an index it would give you a weighted chance to pull a 3 50% of the time.

    -Waswas
Re: Drawing samples from a set
by tall_man (Parson) on Jan 28, 2003 at 16:29 UTC
    Sounds to me like you could create a Huffman encoding tree for your probabilities (see Algorithm::Huffman). Take one branch or the other with the probability of the sum of the probabilities in that branch. The binary tree will lead to the most frequent choices being found faster.
Re: Drawing samples from a set
by blokhead (Monsignor) on Jan 28, 2003 at 17:39 UTC
    My Math 378 professor actually just touched on this algorithm the other day:
    1. Take the sum of all the relative weights and call it N
    2. Take a random number between 0 and 1, and multiply it by N. Call the new number X.
    3. Iterating through each element of the list do the following:
    4. Subtract the weight of the element from X.
    5. If X went negative, you are sitting on the chosen element, so quit.
    6. Otherwise, move to the next element.
    Works nicely on both real and integer weights, and doesn't need extra storage to fill up a normalized array with duplicates. Also, the relative weights don't need to be normalized to probabilities.

    blokhead

Re: Drawing samples from a set
by artist (Parson) on Jan 28, 2003 at 15:30 UTC
    Not much optimization here, but provides a good starting point who want to take it further.
    my $hash = { 1 => 0.2, 2 => 0.3, 3 => 0.5 }; my $sum; my $hash2; while(my($key,$value) = each %{$hash}){ $sum += $value; print "$sum . $key\n"; $hash2->{$sum} = $key; } foreach (1..10){ my $x = rand(1); print "$x\t"; foreach (sort { $a <=> $b } keys %{$hash2}){ if($x < $_){ print $hash2->{$_},"\n"; last; } } }
    artist
Re: Drawing samples from a set
by no_slogan (Deacon) on Mar 14, 2003 at 00:28 UTC

    I gather that you're interested in sampling "without replacement" as they say, so the same item can't be drawn twice. If you're only sampling a few items out of hundreds, it may be best to simply reject duplicates. Keep a hash of items you've already selected, and loop until you get a new one. If the probability distribution is highly skewed, or you're taking a large sample, then you're likely to waste a lot of time picking duplicates with this scheme.

    The problem is -- this is not very efficient, especially if you have hundreds of items whose probabilities change all the time.

    It's often not necessary to actually compute the probability of each item; you can simply work with weights that are proportional to the probability. An item can be effectively removed from the set by setting its weight to zero. The other weights don't need to change.

    This thread has a number of good suggestions for the case where the weights do not change. However, if even one of them changes, everything has to be re-examined to update the data structure, at a cost of O(n). I have a solution below which stores the weights in a balanced binary tree instead. Only the ancestors in the tree need to be visited when something is changed, so the cost is O(log n) for a single change. Choosing a random element is O(log n); choosing a sample of k of them is O(k*log n). The code is not terribly pretty, but I think it's fairly efficient. Only the tree-building routine is recursive.

Re: Drawing samples from a set
by jdporter (Paladin) on Jan 28, 2003 at 19:30 UTC
    Something very similar to this was discussed before; you might something useful in this thread.

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://230605]
Approved by valdez
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-23 14:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found