"be consistent" PerlMonks

### A lesson in statistics

by 0xbeef (Hermit)
 on Mar 19, 2007 at 20:33 UTC Need Help??

0xbeef has asked for the wisdom of the Perl Monks concerning the following question:

Dear Stat Monks,

I am a fool, and strictly this problem is not perl, but lack of statistical knowledge. I apologise for that... I present the following simplest of problems:

```vmstat 1 10 extract:
po fr
0  0
0  0
150  10
0  0
0  0
I wish to calculate the ratio po:fr for this. If it exceeds 15:1, make some printf noise. My current solution is:

```
my \$tlsamples = @series_po; # = @series_fr
return 0 if (\$tlsamples == 0);
my \$sum_po = sum(\@series_po); # = 150
my \$sum_fr = sum(\@series_fr); # = 10

\$sum_fr = 1 if (\$sum_fr == 0);
my \$avg_po = \$sum_po / \$tlsamples; # =150 / 5 = 30
my \$avg_fr = \$sum_fr / \$tlsamples; # = 10 / 5 = 2

\$avg_fr = 1 (if \$avg_fr == 0); # avoid div/0
my \$pofr = \$avg_po / \$avg_fr; # = 15

This result of 15:1, is the same as for the following series:

```po  fr
150 10

The problem is, I need the zeroes to be significant in the first series, since they are. A single value spike should not be able to cause an alert, given many other zero values! (where 0 = no activity in vmstat context)

I have zero (pun intended) statistical background. I have thought of substituting each zero value to its nearest least-signicant alternative e.g.

```po fr
150 10
1 1
1 1
1 1
1 1

In this case the ratio works out to (154/5) / (14/5) = 11. Is there a correct statistical perl-friendly approach that provides significance to the zeroes in the series?

Niel

Replies are listed 'Best First'.
Re: A lesson in statistics (no, specs)
by tye (Sage) on Mar 20, 2007 at 01:47 UTC

For context:

```
po  Pages paged out
fr  Pages freed per second
```

Since po=1 and fr=0 is more than a million times "worse" than your "15 times" threshold and yet I really doubt it represents a situation that you want to be worried about, I think your "15 times" criteria is not enough.

Your problem sample data shows samples where every single sample has fr <= 15*po so, of course, it fires the "15 times" alarm. That problem is more with your choice of alarm criteria than with your arithmetic.

If your "15 times" does a good job even for quite large values (it certainly doesn't for very small values), then perhaps you just need to add a minimum criterion. Forcing fr=1 as a minimum is a fine way of saying that pr < 15 is never alarming.

So if pr stays at 14 for many samples while fr stays at 0 for many samples, is that indicative of a problem? It goes off the scale for your stated "15 times" criteria. But it never reaches the criteria if you set a minimum of 1 for fr.

Is po=300,fr=15 really much more worrying than po=3000,fr=250 ?

So play with some more data and figure out criteria that better represent the situation you are worried about than just "15 times".

- tye

Sorry for misleading you, but my initial example is bogus - I merely tried to illustrate the problem I had in requiring the zero-values to be significant in the ratio.

The real-life alert is called the Thrashing Severity Ratio, and is for a po:fr ratio = 1/6 (17%). This is described by Tom Farwell in a writeup of paging spaces, and may be somewhat specific to IBM's AIX.

My problem with that writeup is two-fold:
1. Periods of inactivity (0,0 values) are not given enough weight (this may lead to false positives)
2. The overall volume (po = 4k pages swapped to paging) is not considered, and low volume spikes may provide additional false positives (but NOT if sustained).

I should perhaps have mentioned the actual problem from the start, but I fear the downvote of Monks who feel that this discussion is not close enough to a pure perl problem!

Niel

Re: A lesson in statistics
by eric256 (Parson) on Mar 19, 2007 at 23:45 UTC

I don't know much statistics, but it seemed like the average of the last x samples would do what you want. This code behaves how I interpreted what you want, changing samples will change how much data it holds to average over, that would be a matter of preference on your part. I would think if you don't restrict the samples you'll just end up with jibberhish, assuming you are sampling some source for this data on a regular basis. If that is the case than this will tell you if at any point the last x samples averaged over 15:1.

```use strict;
use warnings;

use Data::Dumper;

my @que    = ();
my \$sample =  5;

sub average_ratio {
my @data = @_;
my \$ratio = 0;

for (@data) {

\$ratio =  \$ratio + (\$_->[1] != 0 ? (\$_->[0] / \$_->[1]) : 0);
}
return \$ratio /  scalar @data;
}

while(my \$line = <DATA>) {
chomp \$line;
my (\$po, \$fr) = split (m/\s/, \$line);
push @que, [\$po, \$fr];
shift @que if @que > \$sample;
my \$avg = average_ratio(@que);
print "Adding [\$po,\t \$fr]\t makes the avg_ratio: \$avg\n";
print "DANGER\n" if \$avg > 15;
}

__DATA__
0 0
0 0
150 10
0 0
200 40
210 40
220 40
220 30
0 0
0 0
0 0
220 20
220 10
220 05
220 01
220 100
220 100
2200 100
2200 2
2200 2
0 0
2200 1
0 0
0 0
0 0
0 0
0 0
0 0

___________
Eric Hodges
Hi Eric,

Thanks for your nice example, it matches what I currently deem the best solution (with input from others here) - calculating the mean of the ratio.

I'm not sure if there is a way to eliminate more false positives... but using this method a single spike will at least not cause an exception.

Niel

Re: A lesson in statistics
by osunderdog (Deacon) on Mar 19, 2007 at 21:39 UTC

Perhaps something like this would work?

```use strict;
use Statistics::Descriptive;
my \$uwlRatio = 15;

my \$poStat = Statistics::Descriptive::Sparse->new();
my \$frStat = Statistics::Descriptive::Sparse->new();

while(my \$line = <DATA>)
{
chomp \$line;
my (\$poData, \$frData) = split( m/\s/,  \$line);

if(\$poStat->mean() > 0)
{
my \$pofrRatioMean = \$poStat->mean() / \$frStat->mean();
if(\$pofrRatioMean > \$uwlRatio)
{
print "DANGER WILL ROBINSON! PO/FR ratio out of spec!\n";
}
else
{
print "PO/Fr ratio within spec: \$pofrRatioMean\n";
}
}
else
{
print "not enough data to calculate ratio.\n";
}
}

__DATA__
0 0
0 0
150 10
0 0
200 40
210 40
220 40
220 30
220 20
220 10
220 05
220 01
220 100
220 100
2200 100
2200 2
2200 2
2200 1

With output like this:

```\$perl example.pl
not enough data to calculate ratio.
not enough data to calculate ratio.
PO/Fr ratio within spec: 15
PO/Fr ratio within spec: 15
PO/Fr ratio within spec: 7
PO/Fr ratio within spec: 6.22222222222222
PO/Fr ratio within spec: 6
PO/Fr ratio within spec: 6.25
PO/Fr ratio within spec: 6.77777777777778
PO/Fr ratio within spec: 7.57894736842105
PO/Fr ratio within spec: 8.51282051282051
PO/Fr ratio within spec: 9.59183673469388
PO/Fr ratio within spec: 7.09459459459459
PO/Fr ratio within spec: 5.85858585858586
PO/Fr ratio within spec: 9.11290322580645
PO/Fr ratio within spec: 13.4939759036145
DANGER WILL ROBINSON! PO/FR ratio out of spec!
DANGER WILL ROBINSON! PO/FR ratio out of spec!

Hazah! I'm Employed!

Well no, since it does not provide any regard for the zero value samples. The zeroes equate to idle-ness, and should negate any quick spikes/activity.

Thanks for pointing out Statistics::Descriptive though!

Niel

Umm, I'm pretty sure that's what average or mean does...

The arithmetic mean, or mean of a set of measurements is the sum of the measurements divided by the total number of measurements.

Further information can be found at: http://en.wikipedia.org/wiki/Arithmetic_mean

The samples that are zero are counted, thus affecting the denominator but not the numerator.

Hazah! I'm Employed!

Re: A lesson in statistics
by kyle (Abbot) on Mar 19, 2007 at 20:55 UTC

Would it help to remove every outlier from the original data set and compute after that?

I'd consider high PO:FR a statistical certainty if the majority of samples (over a predetermined fixed period) shows a 15:1 or higher ratio. I have only been looking at this extreme case, but I actually don't think you could, since the outlier is significant (it proves a real thing - that momentary thrashing is occurring) - and calculating the ratio of the average factors that in.

Niel

Re: A lesson in statistics
by hangon (Deacon) on Mar 20, 2007 at 04:27 UTC

My statistics is a bit rusty, and tye has a better handle on what you're actually doing, but this might help. When working with statistics, you generally define a range of acceptable sample values, and any values outside of this range are discarded. These are called outliers. For example, below is your code modified to ignore the samples where po == 0, so they will not skew your results.

```my \$tlsamples = @series_po; # = @series_fr

my \$ok_samples = 0;
my \$sum_po = 0;
my \$sum_fr = 0;
for (my \$i; \$i < \$tlsamples; \$i++){

# set up any conditions to skip outliers here
if(\$series_po[\$i] == 0){
next;
}

# count & sum only good samples
\$ok_samples++;
\$sum_po += \$series_po[\$i];
\$sum_fr += \$series_fr[\$i];
}
return 0 if (\$ok_samples == 0);

\$sum_fr = 1 if (\$sum_fr == 0);
my \$avg_po = \$sum_po / \$ok_samples; # =150 / 5 = 30
my \$avg_fr = \$sum_fr / \$ok_samples; # = 10 / 5 = 2

\$avg_fr = 1 (if \$avg_fr == 0); # avoid div/0
my \$pofr = \$avg_po / \$avg_fr; # = 15

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://605580]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?