What is the best way to compare profiling results before and after a code change?

ELISHEVA has asked for the wisdom of the Perl Monks concerning the following question:

Recently I was trying to profile a script and noticed that when I ran the script two or more times in sequence the total time and the breakdown between the different start-up (perl) and actual execution functions seems to change with each run. On a short script the variance can be as much as 30-40% (e.g. ranging between 0.040s-0.090s). Even on longer running scripts the variance is often in the 10% range.

This variance happens even on a single user machine with nothing but OS related background processes running. Obviously, those can't just go away, so I presume this variance is just a given of the profiling process.

However, this raises a question for me. How do I tell if a code change really improves performance? If I take only one profiling result before and after I can't really tell if the "improvement" is due to the code change or just an artifact of the background processes and resource sharing at the time of each profiling run.

Of course, I could do several profiling runs before and after and take averages. Is this something others do? Is there software designed to run profiling several times and calculate the statistics? And if we are going the statistics route, how many runs are needed to get a reliable result? Is an average really the best "central tendency" for comparing before and after results (alternatives: median, mode, min, max)?

Many thanks in advance, beth

Comment on What is the best way to compare profiling results before and after a code change?

Replies are listed 'Best First'.
Re: What is the best way to compare profiling results before and after a code change? by perrin (Chancellor) on Apr 11, 2009 at 20:12 UTC
Profiling is not for checking performance, it's for finding bottlenecks. Comparing two profiling runs can tell you if you moved the bottleneck. For checking if performance was improved, you need a benchmark. Most benchmark tests run several times, throw away the lowest and highest outliers, and average the rest.	[reply]
Re^2: What is the best way to compare profiling results before and after a code change? by ELISHEVA (Prior) on Apr 11, 2009 at 20:40 UTC
Thank-you for the clarification of terminology and the description of benchmarking practice. I'm interested in how OS-competition related variance affects both profiling and benchmarking. The number of times function X is called is, of course, stable from profile run to profile run, but at least in my experience with profiling the ranking of function calls according to time can vary greatly from run to run. For example, in one run a function that was called ~5000 times clocked at 0.016s, ranked 3rd and consumed 10.8% of the time. In another run using the same data, that same function clocked at 0.003s, ranked 4th, and consumed 4.88% of the time. A function that consumes 11% of the time is a potential bottleneck, 5% of time, I'm not so sure. Best, beth	[reply]
Re^3: What is the best way to compare profiling results before and after a code change? by perrin (Chancellor) on Apr 12, 2009 at 14:33 UTC
To deal with that, I usually run the thing I'm profiling a few times in order to average out the differences. It's not practical with really large programs, but when possible I'll do 10 runs while recording profile data. That tends to smooth things out.	[reply]
Re: What is the best way to compare profiling results before and after a code change? by CountZero (Bishop) on Apr 11, 2009 at 19:23 UTC
The average is probably not a good measure, you will also have to look at the spread around the average (i.e. the variance). Checking if the changes between your programs are significant (due to your changes in the program) or simply due to random (background) effects can be done by running an "ANOVA"-test. The NULL-hypothesis would then be that the variance between runs of the program before and after you changed it is not bigger than the variance within each run. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^2: What is the best way to compare profiling results before and after a code change? by ELISHEVA (Prior) on Apr 11, 2009 at 20:22 UTC
Thank you for taking the time to answer my question. You make an excellent point about variance, but shouldn't the null hypothesis be that there is no (statistically significant) change in mean/median/mode? I'm not sure what you mean by "the variance within each run". Your idea of comparing the size of variance before and after variance suggests a side effect that I hadn't thought of: changing the code can change the way the script competes with the operating system and may change the variance before and after. I can see that in certain real time situations where timing really matters, you might want to profile that as well as average performance time. However, in my case "usual" performance time, rather than consistency of performance time, is the primary concern. As for using an ANOVA - that would only apply if the distribution of profiling results is normal. If the distribution is skewed or has overly thick or thin tails, then you would have to use other techniques to analyze the variance. Without knowing the distribution it is very hard to tell how many standard deviations (sqrt of variance) are needed to make the difference between the old and new mean statistically significant (5-7 are needed for a normal distribution). Best, beth	[reply]
Re^3: What is the best way to compare profiling results before and after a code change? by CountZero (Bishop) on Apr 11, 2009 at 20:42 UTC
What I mean is that each run of the same program will see a different outcome around the mean. As you know the variance (or the standard deviation, if you like) is a measure of the spread of the actual results around the mean. The mean and variance of different versions of the same program will tell you whether these different versions are faster or not, but perhaps the spread between different versions is less than the spread within each version and then the differences between the mean times each version ran are not really significant. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]

Back to Seekers of Perl Wisdom