http://qs321.pair.com?node_id=588397

willyyam has asked for the wisdom of the Perl Monks concerning the following question:

This is more of a math question than a Perl question, but I know there are monks with far greater math skills than I, and I hope to pick their brains.

Here's the question:

I have the population figures for several entities. What I am looking for is the probability of a given sample of entities occurring at random. Update: As the first response below demonstrates, I have not been sufficiently clear. I'm looking for a the likelyhood of getting a random sample of seven entities (or however many specified) from the total population represented by the population numbers provided. The order of the sample doesn't matter, just that it has those entities in the stipulated proportions. Some pseudocode:

#!/usr/bin/perl use strict; use warnings; # Population numbers my $entity1 = 35000 my $entity2 = 41000 my $entity3 = 16000 my $entity4 = 18000 my $entity5 = 21000 my $entity6 = 45000 my $entity7 = 27000 my $entity8 = 10000 my $entity9 = 16000 # Sample constituents my @sample = ($entity9,$entity9,$entity4,$entity2,$entity6,$entity1,$e +ntity1) sub probability { # Return the probability of @sample occurring at random based on the # populations declared }

I would be perfectly happy if someone could show me how to do the math, and I can do the Perl, although seeing how to do the math in Perl would be great too. Thanks.

Replies are listed 'Best First'.
Re: Sample Probabilities
by blokhead (Monsignor) on Dec 07, 2006 at 18:15 UTC
    Just to be sure I understand you want ... You have 35 red widgets, 41 green widgets, 16 blue widgets, and you want to know the probability of getting 2 red, 1 green, 0 blue, if you were chosing 3 items randomly from all widgets? Also, the order in which they were drawn is irrelevant (drawing green,red,red is the same as red,green,red), and you are drawing without replacement.

    Continuing with that example, the probability of getting green,red,red, in that order, is:

    (41/92)*(35/91)*(34/90)
    All we have to do now is multiply this quantity by the number of ways to rearrange (green,red,red). In this case, there are 3 ways to rearrange them.

    In general, the probability of getting x1 red, x2 green, x3 blue, etc, in some particular order, is the following: (I separated out the numerator and denominator to make it easier)

    (#red * (#red-1) * ... (#red-x1+1)) * (#green * (#green-1) * ... (#green-x2+1) * ... / ( total * (total-1) * (total-2) * ... )
    And the number of ways of reordering a list of x1 reds, x2 greens, x3 blues, is
    (x1+x2+x3+ ...)! / (x1! * x2! * x3! * ... )
    Those exclamation points are for factorials.

    As perl code, and using the example in your original post (I changed the indexing to be 0-based):

    use List::Util 'sum'; sub fact { my $n = shift; my $f = 1; $f *= $_ for 1 .. $n; $f; } ## $n * ($n-1) * ... * ($n-$m+1) sub lower { my ($n, $m) = @_; my $result = 1; $result *= $n-$_ for 0 .. $m-1; $result; } sub prob { my ($sizes, $sample) = @_; my $n = @$sizes; ## how many types of items my $S = sum @$sizes; ## total number of items in the universe my @sampled = (0) x $n; my $samplesize = @$sample; $sampled[$_]++ for @$sample; ## $sampled[x] = how many of type x ## are in this sample? my $total = fact($samplesize) / lower($S, $samplesize); $total *= lower($sizes->[$_], $sampled[$_]) / fact($sampled[$_]) for 0 .. $n-1; $total; } ## how many ways to get 2 of type 8, 1 of type 3, etc.. ? print prob( [35000,41000,16000,18000,21000,45000,27000,10000,16000], [8,8,3,1,5,0,0] ); ## output: 0.000397344651946416
    Disclaimer: This code not tested in any great detail. Also keep in mind that all the probabilities will look small, because there are so many different events in this distribution.

    Update: updated computation to account for drawing without replacement. Old code is still in comment tags. The probabilities hardly change at all since the domain is so large.

    Update: fixed copy & paste error, thanks to BrowserUk++.

    blokhead

      There appears to have been a C&P error in your code. This line is missing a ;

      $total *= lower($sizes->[$_], $sampled[$_]) / fact($sampled[$_])

      and by implication of your use of $_, also a for modifier?

      Should that be?

      $total *= lower($sizes->[$_], $sampled[$_]) / fact($sampled[$_]) for + 0..$n-1;

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      Outstanding! Thank you. It's been a long time since Finite Math in High School, and I just couldn't remember how to approach this. The problem with Google and most probability tutorials is that you need to know what things are called. Thanks again.

Re: Sample Probabilities
by davido (Cardinal) on Dec 07, 2006 at 17:28 UTC

    First, your code would probably be cleaner like this (avoiding the use of variables like $entity1 .. $entity9; that's what arrays are for):

    use strict; use warnings; my @entity = ( 35000, 41000, 16000, 18000, 21000, 45000, 27000, 10000, 16000, ); my( @sample ) = @entity{ 8, 8, 3, 1, 5, 0, 0 }; my $prob = probability( \@entity, \@sample );

    For your probability data, consider each "draw" from @entity to be similar to a roll of the dice. Each roll has equal odds of hitting a given number, and a number can be hit more than once. That means the probability of getting a particular sequence should be equally improbable no matter what the sequence happens to be. It is just as improbable of rolling from slot 1, 2, 3, 4, 5 as it is to roll from slot 1, 1, 1, 1, 1. The probability, therefore, is based on the number of rolls and the number of "sides" (number of entities), not the sequence itself.

    On the other hand, if you're gathering a total based on the draws, the probability of totals will fall along a bell curve. This is something that D&D players know well: If you roll three six-sided dice, your odds of getting a 3 or an 18 are less than your odds of getting an 11. That's because there are more combinations that could net you the sum of 11 , but only one combination (all sixes) will get you a sum of 18. There's an article about this at wikipedia here: Dice.

    One thing to remember is that your containers are "ever-full". When you draw (or roll), you're not removing, you're just looking and putting it back. At least that's the way your sample code seems to be handling things.

    Your question I believe is about the probability of hitting a particular sequence. If you have nine entities, each time you draw you have a 1/9th chance of getting a particular one. Your chances of getting a particular sequence of seven draws from a nine-slot bucket is (1/9)^^7.

    That means that sub probability() can be as simple as this:

    sub probability { my $entity_count = @{$_[0]}; my $sample_count = @{$_[1]}; return( ( 1 / $entity_count ) ^^ $sample_count ); }


    Dave