http://qs321.pair.com?node_id=436861

I took calculus as part of my electrical engineering degree. The irony of majoring in EE is that I'm not that hot at math, at least abstract math. I like hard numbers that represent real things. The professor explained it like this. He graphed a function on the board and started working out how to find the area below the curve. As an aside, he started to explain that a chemist, confronted with the same problem, would simply draw the graph out 10 times and weigh the resulting graphs to arrive at the answer. Depending on the needed accuracy, that chemist might perform the graph and weigh 100 or 1000 times. After he explained this, I thought to myself, 'crap, I guess that means I should have gone with chemistry'.

Recently, I came up against a real life situation involving complex curves and the area under them. I dutifully combed the internet and dug out my old probability and statistics text book to figure out the answer to the question, "How do I find the union of two normal distributions?" (Normal distribution - the famous "bell curve").

I found many many texts, some of them starting out with just the basics and working their way up. I found lot's of complex formulas that did not do quite what I was looking for, formulas that looked like they were written in sanskrit. After 5 or 6 hours of frustration, I decided to look at the problem from a chemist's point of view: empirically.

Empirically comparing normal distributions: A normal distribution is a graph of probabilities, really just an array. Perl is great for handling this stuff, so I wrote a method that converts a mean and standard deviation into an array plotting out what a population of 100 would look like. I did this again for the second mean and SD and then compared the results. If a point appeared in both arrays, it was counted. Then, a simple look at the relationship between the number of 'common' points to the population and presto! A simple percentage of similarity!

As an added bonus, I could increase the population to arbitrarily high numbers (1000 or 10000 points) to get better resolution on the graphs and hence, more accurate percentages. I would make my script plot out the 100's of graphs and weigh them for me.

A quick search found me a GPL'd method for finding the percentage of a population at a datapoint based on the mean and SD:

sub int_gen_curve { my ($self, $x, $mean, $sdev ) = @_; my $pi = 3.14159265358979323844; return 1 / ( sqrt( 2 * $pi * $sdev * $sdev )) * exp( -($x-$mean)*($x-$mean) / ( 2 * $sdev * $sdev )) +; }
Where '$x' is a point on the x axis. Next, I wrote a simple routine for turning my mean and standard deviation into an array:
sub compare_bell_curves my ($self,$m1,$sd1,$m2,$sd2) = @_; my $upperbound = sprintf("%.1f", ($m1 + (1.75 * $sd1))); my $lowerbound = sprintf("%.1f", ($m1 - (1.75 * $sd1))); my $area = 10000; my $data; for (my $x=$lowerbound; $x < $upperbound; $x = $x + 1) { $x = sprintf("%.0f", $x); my $posarg = $self->int_gen_curve($x,$m1,$sd1); $data->[$x] = int($area*$posarg); } # do the above again for the $m2 and $sd2 # then subtract $data1 from $data, any $data->[x] # that remains positive is a point of difference
In the end, I just had to count the positive points in $data after the subtraction and find out what the percentage relationship was to the population. If there would be NO positive points ($data1 removed all of the points from $data) I would have two equal distributions.

Perl allowed me to reduce a quite complex problem into it's basic elements and solve it using real numbers, empirically. As an added bonus, I had datasets that could easily be plugged into GD::Graph, to give that extra, visual, representation of what the data said.

Does anybody else think like this, or am I just kooky?