Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Biased random number selection

by athomason (Curate)
on Jul 12, 2000 at 00:11 UTC ( [id://22091]=perlquestion: print w/replies, xml ) Need Help??

athomason has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks-

I need to select a random number from 0 to n, with the caveat that each number should not be selected with equally probability. I have an array (say, @freq, defined from 0..n) containing non-normalized relative frequencies of each index. In other words, the probability of selecting i should come out to be $freq[i]/$sum_over_freq.

The routine is in the middle of a very time-critical loop (I'll eventually need to port this to Fortran on a 128-proc RS/6000, in fact), so speed is the primary concern. I've coded up a few ugly ways to do this, but what I have is all too slow. @freq is approximately static throughout the critical loop, so it might be favorable to generate a lookup within the loop that encases it. Any ideas on how best to implement this?

Replies are listed 'Best First'.
Re: Biased random number selection
by ferrency (Deacon) on Jul 12, 2000 at 00:57 UTC
    If you're choosing between n different values, it's very easy to make an order(n) routine to do this, as someone else has already shown.

    But, you can do better. If you pre-build an array of partial frequency sums, then you have a sorted array, and you can do a binary search on it for the correct number. This would be faster, If n is relatively large, And if the frequency table doesn't change very much.

    # Warning: Untested code, almost definitely has off by one errors! # Given a frequency table in @freq, build a summed frequency table. my $sum; @sf = map {$sum += $_} @freq; # Now we choose a random number in the range of total frequency sum my $target = rand($sum); # To find out what the Answer is (weighted freq from 0 to n), # we now find n such that $sf[n-1]< $target < $sf[n] # We keep track of the numbers we know $n is between my $a = 0; my $b = @sf; my $n; # repeat until we narrow down to one answer while ($a < $b) { $n = int(($a + $b) / 2); if ($sf[$n] < $target) { $a = $n; } elsif ($sf[$n-1] > $target) { $b = $n; } else { # $target < $sf[n] and $target > $sf[n-1], # so we have an answer. last; } } print "I found $n\n";
    Like I said, this is not going to work as-is, but hopefully you get the idea. There is a greater overhead in the loop than doing a linear search, and a greater setup cost to build the array of summed frequencies. But the loop runs in order (log n) time instead of order (n) time, so for sufficiently large n, this can be a great improvement over a straight linear search (use benchmarking to make sure it's a Win).

    If you have a known frequency distribution (ie the frequencies are all almost equal, or there are a lot of low frequencies and a few high ones), you may be able to get Even Faster results if you use some heuristics to choose a better method than "look halfway between my known upper and lower bounds for n." For example, if your frequencies are all pretty equal, you can get a better idea of a guess for $n by seeing how far away $sf[$n] is from $target, and moving $n an appropriate distance, instead of just going halfway.

    Hope this helps. Does anyone know of a good binary search module?

    Alan

      Very cool. The summed frequency array was one of the ideas I implemented, though I shot myself in the foot after that part. I was so determined to squeeze every ounce of speed out that I constructed an array of length $sf[n], with $freq[i] elements with value i for each i. The lookup is O(1), which beats the pants off O(logn) and O(n). Unfortunately, my n actually is fairly large (~10,000), and I ran out of memory pretty quickly (even with 2GB/proc). The O(n) linear search over @sf was going to be my next attempt, but I'll certainly compare it to the binary search.
        Yep, overall it's a typical speed/space tradeoff. Another benefit of the linear or binary traversal solutions is, your frequencies needn't be integers (well, assuming rand() works correctly with non-integers, that is).

Re: Biased random number selection
by btrott (Parson) on Jul 12, 2000 at 00:21 UTC
    This is the algorithm I've seen posted elsewhere (when I did a search a long time ago in the perl newsgroups), and that I've used myself:
    my($tw, $winner) = (0, undef); my @numbers = ( ... ); my @freq = ( ... ); for my $idx (0..$#numbers) { $tw += $freq[ $idx ]; $winner = $idx if rand($tw) < $freq[ $idx ]; } return $winner;
    There's also an example in the Perl Cookbook, apparently, if you have that.
Re: Biased random number selection
by gnat (Beadle) on Jul 12, 2000 at 01:12 UTC
    Here's the code from the Perl Cookbook:

    If you have a list of weights and values you want to randomly pick from, follow this two step process: first, turn the weights into a probability distribution with weight_to_dist below, then use the distribution to randomly pick a value with weighted_rand:

    # weight_to_dist: takes a hash mapping key to weight and returns # a hash mapping key to probability sub weight_to_dist { my %weights = @_; my %dist = (); my $total = 0; my ($key, $weight); local $_; foreach (values %weights) { $total += $_; } while ( ($key, $weight) = each %weights ) { $dist{$key} = $weight/$total; } return %dist; } # weighted_rand: takes a hash mapping key to probability, and # returns the corresponding element sub weighted_rand { my %dist = @_; my ($key, $weight); while (1) { # to avoid floating point inac +curacies my $rand = rand; while ( ($key, $weight) = each %dist ) { return $key if ($rand -= $weight) < 0; } } }
    I hope this helps. I'm not sure how closely it approximates your problem. Let me know if I can help more.

    Nat

Re: Biased random number selection
by Anonymous Monk on Jul 12, 2000 at 06:14 UTC
    There's an algorithm in the cookbook that deals with this problem, and that's already quoted here.

    But, if you are willing to spend the memory, then you might be able to to do it faster. The following algorithm assumes that @freq contains non-negative integers (the algorithm can trivially be extended to non-negative rational numbers). It also assumes you have enough memory for an array of size $N, where $N = do {my $s = 0; $s += $_ for @freq; $s}, and that the frequency doesn't change.

    After preproccessing, the sampling can be done in constant time.

        my @set = map {($_) x $freq $_} 0 .. $#freq;
     
        # Draw a weighted random number.
        sub draw {$set rand @set}  # Now, where did the brackets go???????
    

    -- Abigail

RE: Biased random number selection
by merlyn (Sage) on Jul 12, 2000 at 13:13 UTC
Re: Biased random number selection
by ferrency (Deacon) on Jul 12, 2000 at 20:03 UTC
    I just thought of one more potential optimization if you're going to use a linear search through the list of frequencies.

    If your frequencies are very disparate (some values are much more likely to occur than others), then searching through the most likely results first would yield results faster, on average. To do this, you can sort the frequency list by decreasing frequency, and then search linearly through the list for your target value. In the frequency list you'd need to store n as well as frequency(n), because the index into the frequency list will no longer be equal to n once you sort the list.

    Alan

      If your frequencies are very disparate (some values are much more likely to occur than others), then searching through the most likely results first would yield results faster, on average. To do this, you can sort the frequency list by decreasing frequency, and then search linearly through the list for your target value.
      My gut says that the cost of sorting will far outweigh the cost of finding your desired item, unless you could cache the result of the sort so it's done once, not once per operation.

      My gut has been known to be wrong before though. {grin}

      -- Randal L. Schwartz, Perl hacker

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://22091]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-04-23 10:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found