Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Probability sum of random variables

by FFRANK (Beadle)
on Aug 30, 2007 at 19:44 UTC ( [id://636186]=perlquestion: print w/replies, xml ) Need Help??

FFRANK has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Need to calculate the probability that the sum of random variables equals some value.

Training set: Number of objects (e.g. arrays of numbers), calculate the frequency that the sum of the object equals 0, 1, 2..max.

Probability: Find the probability that the sum of a number of random variables equals some number (between 0 and max), given those frequencies from the training set.

Example: After sampling (X) objects, the following frequencies have been found from the training set:
Freq(sum=0) = 0.1
Freq(sum=1) = 0.2
Freq(sum=2) = 0.7

What is the probability that the sum of 2 elements = 4? This code does the job:

#!/usr/bin/perl -w use strict; use Algorithm::Loops qw(NestedLoops); use List::Util qw(sum); #Set no. of elements, frequencies and the sum value my $nElements = 2; my @freq = qw (0.1 0.2 0.7); my $targetSum = 4; #Ways of generating sum given no. of elements my @pos = (0..$#freq); my (@combo, @matrix); for my $c (0..$nElements-1) { push @combo, [@pos]; } my @nest = NestedLoops(\@combo, sub { [ @_ ] } ); for my $n (0..$#nest) { push @{$matrix[sum @{$nest[$n]}]}, join ('', @{$nest[$n]}); } my @ways = @{$matrix[$targetSum]}; #Now calculate P-value my @pVal; foreach my $w (0..$#ways) { my @f = split ('', $ways[$w]); #print "@f\n"; my $p = $freq[$f[0]]; #print $p,"\n"; for my $g (1..$#f) { $p *= $freq[$f[$g]]; } push @pVal, $p; } my $pVal = sum @pVal; print "P-value(sum of $nElements = $targetSum) = $pVal\n"; #Prints: P-value(sum of 2 = 4) = 0.49
I have much LARGER SUMS to evaluate that derive from larger number of elements (thousands), with MORE FREQUENCIES (up to 10) and doubt that this code will ever scale up well.

Is there a module that can be used for this ? Or is there some way to do this that would scale up nicely. Convolutions ? FFT ? I am no mathematician, if you may please provide details.

Thank you very much, best.

FFRANK

Replies are listed 'Best First'.
Re: Probability sum of random variables
by Joost (Canon) on Aug 30, 2007 at 20:58 UTC
    I am not a mathematician, but I *think* the question can be solved without any training set if you know "how random" the random numbers are (i.e. what's the distribution)?

    For instance, it's easy to see that if the numbers are evenly distributed between 0 and 1, for N random numbers N / 2 would be have the highest frequency, while 0 and N would have the lowest.

    Sorry for derailing this thread if you've got good reason to need the training set - just a thought.

      Hi Joost, The training set isn't a problem to compute, and necessary to know what is the probability of obtaining any given sum. The problem is more on how to scale up the recursive convolution. FFRANK
Re: Probability sum of random variables
by Cristoforo (Curate) on Aug 31, 2007 at 01:02 UTC
    Although I can't install the module Algorithm::Loops, I believe if I'm following correctly (after reading up on this module on CPAN) that the sums you have set up are:
    0 + 0 = 0 0 + 1 = 1 0 + 2 = 2 1 + 0 = 1 1 + 1 = 2 1 + 2 = 3 2 + 0 = 2 2 + 1 = 3 2 + 2 = 4
    Then, when you calculate the p value for the sucessful combination (2 + 2 is the only one to equal 4), you've mutiplied the probability of a 2 (= .7) times the probability for another 2 and get the answer (.49). If there had been other combs that equaled 4, their probability would have been calculated and added to the .49.

    I guess I am wondering if the sum to be gotten was 2, for example, then 3 cases would qualify with your algorithm.

    0 + 2 = 2 1 + 1 = 2 2 + 0 = 2
    But, do you really want to count (0, 2) *and* (2, 0) because they are the same elements (from the original array), only one is the reverse of the other. Also, do you want to include the same item added to itself (1 + 1)?

    The numbers you are generating now are also known as the cross product of n sets. Perhaps you just want to test all the possible combinations instead?

    0 + 1 = 1 0 + 2 = 2 1 + 2 = 3

    As far as your code scaling up, it depends on how large the number of trials. If you use the cross product, with 10,000 numbers there will be 10,000 * 10,000 trials. On the other hand, if you tested the combinations there would be 10,000 * 9,999 / 2 trials - half as many.

    Chris

      Hi Cristoforo, you getting the point right. If, for example, only three "object sums" are possible (0, 1 and 2).

      $nElements = 1: Sum = 0 -> 0 Sum = 1 -> 1 Sum = 2 -> 2 $nElements = 2: Sum = 0 -> 0+0 Sum = 1 -> 0+1 1+0 Sum = 2 -> 0+2 1+1 2+0 Sum = 3 -> 1+2 2+1 Sum = 4 -> 2+2 $nElements = 3: Sum = 0 -> 0+0+0 Sum = 1 -> 0+0+1 0+1+0 1+0+0 Sum = 2 -> 0+0+2 0+1+1 0+2+0 1+0+1 1+1+0 2+0+0 Sum = 3 -> 0+1+2 0+2+1 1+0+2 1+1+1 1+2+0 2+0+1 2+1+0 Sum = 4 -> 0+2+2 1+1+2 1+2+1 2+0+2 2+1+1 2+2+0 Sum = 5 -> 1+2+2 2+1+2 2+2+1 Sum = 6 -> 2+2+2 And so on...
      Of course, there is some redundancy in there, for ex. Element = 3, sum = 1 could be treated as (3*((p0)**2)*p1). But, how to generate this automatically and fast ?

      Thanks, FFRANK

Re: Probability sum of random variables
by dwm042 (Priest) on Aug 30, 2007 at 20:10 UTC
Re: Probability sum of random variables
by dwm042 (Priest) on Aug 31, 2007 at 02:39 UTC
    FFRank,

    I'll simply point out in the case where there are only two states (0 and 1), your calculations devolve into calculations of binomials. There are formulas for these, they can be exactly calculated without brute forcing them. What you're looking at I'd call a multinomial formula, a "generalized" binomial formula. The formula probably can be found in a decent mathematical physics text or a text on statistical mechanics, or a higher end probability text. For that matter, the multinomial formula is given here.

    This of course means what you're dealing with are factorial calculations as you scale up. I think you'll lose the ability to get exact answers rapidly, though the logarithmic form of Stirling's formula may allow you approximate answers on the high end of things.

    David

    Update: More exact language. Binomial coefficients aren't binomial distributions. Added a link to a multinomial formula.
      Hi David, I believe the multinomial formula does not apply here, as we deal with the probability of a sum given n trials (over the combinations of outcomes that can generate the sum over n trial).

      Given: number of trials = t, sum of random variables = Z, the discrete convolution is:

      Pr[Yt = Z] = ∑ (e = 0 .. Z) Pr[Yt-1 = Z-e] * Pr[e]

      FFRANK

Re: Probability sum of random variables
by Cristoforo (Curate) on Aug 31, 2007 at 01:49 UTC
    Using the Set::CrossProduct module, here is another approach to your present solution.
    #!/usr/bin/perl use strict; use warnings; use Set::CrossProduct; use List::Util 'sum'; my $targetSum = 4; my @freq = qw (0.1 0.2 0.7); my $nElements = 2; my @combo = ([0..$#freq]) x $nElements; my $cross = Set::CrossProduct->new( \@combo ); my @ways; while( my $array_ref = $cross->get ) { push @ways, $array_ref if sum(@$array_ref) == $targetSum; } #Now calculate P-value my $pVal; for my $aref (@ways) { my $p = 1; for my $operand (@$aref) { $p *= $freq[$operand]; } $pVal += $p; } print "P-value(sum of $nElements = $targetSum) = $pVal\n";
Re: Probability sum of random variables
by Prof Vince (Friar) on Aug 31, 2007 at 10:03 UTC
    The distribution of the sum of two independant random variables is (IIRC) the convolution product of the distributions. From that, it's rather simple to compute it :
    use List::Util qw/sum/; my @a; # holds the known distribution sub conv { my ($n) = @_; return sum map { ($_ < @a) ? ((($n - $_) < @a) ? $a[$_] * $a[$n - $_] : 0) : 0 } 0 .. $n; }
    If you want the distribution of the sum of n random variables, you can adapt this code to accept two distributions and recursively compute the n-th convolution product of @a.
Re: Probability sum of random variables
by injunjoel (Priest) on Aug 31, 2007 at 17:12 UTC
    and still no mention of PDL???

    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
Re: Probability sum of random variables
by b4swine (Pilgrim) on Sep 02, 2007 at 23:18 UTC
    This should easily do large sums of many random variables. The technique to do large convolutions is called http://en.wikipedia.org/wiki/Exponentiation_by_squaring.

    BTW P-value has a different meaning. You really just want to call it probability.

    Enjoy!

    #!/usr/bin/perl -w use strict; #Set no. of elements, frequencies and the sum value my @freq = qw (0.1 0.2 0.7); my $nElements = 2; my $targetSum = 4; # # If you are only interested in the first few $targetSums # and @freq is a large array, you can speed up the computation # and reduce space requirements by setting $maxSum to the # largest conceivable $targetSum # my $maxSum=400; sub conv { my ($p, $q) = @_; my ($np, $nq) = (scalar @$p, scalar @$q); my @ans = (); push @ans, 0 for (1..$np+$nq); my $ip = 0; for my $vp (@$p) { my $iq = 0; for my $vq (@$q) { $ans[$ip+$iq] += $vp*$vq; $iq++; last if $ip+$iq > $maxSum; } $ip++; } return \@ans; } # see http://en.wikipedia.org/wiki/Exponentiation_by_squaring sub convn { my ($n, $p) = @_; my $ans = [1]; while ($n) { $ans = conv($ans, $p) if $n & 1; # if n is odd $p = conv($p, $p); use integer; $n = $n >> 1; } return $ans; } my $ans = convn($nElements, \@freq); print "Prob(sum of $nElements = $targetSum) = $ans->[$targetSum]\n"; #Prints: Prob(sum of 2 = 4) = 0.49 print convn(200, [0.1, 0.2, 0.3, 0.2, 0.1, 0.1])->[400]; #Prints: 0.000210606620621573
      Hi b4,

      Your solution is great and fast, Thanks very much !

      FFRANK

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://636186]
Approved by kyle
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-25 21:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found