comment on

Not everything that can be counted counts and not everything that counts can be counted

-- William Bruce Cameron

What's measured improves

-- Peter Drucker

Three recent events got me thinking about software metrics again:

Management use individual KPIs to reward high performers in Sales and other departments. They are contemplating doing the same for Software Developers.
Performance Appraisals often seem subjective. Would metrics make them more objective? Or do more harm than good?
Larry Maccherone was in town recently promoting his company's approach to Agile metrics.

I'm interested to learn:

How does your company reward Software Developers? Are the rewards team-based, individual-based, department-based, whole-company based? How well does it work?
Do you have Performance Appraisals? Do they use metrics? Do your Software Developers/Teams have KPIs?
Do you use metrics to improve your Software Development Process?

I've done a bit of basic research on these topics, which I present below.

Software Metric Gaming

Key performance indicators can also lead to perverse incentives and unintended consequences as a result of employees working to the specific measurements at the expense of the actual quality or value of their work. For example, measuring the productivity of a software development team in terms of source lines of code encourages copy and paste code and over-engineered design, leading to bloated code bases that are particularly difficult to maintain, understand and modify.

-- Performance Indicator (wikipedia)

"Thank you for calling Amazon.com, may I help you?" Then -- Click! You're cut off. That's annoying. You just waited 10 minutes to get through to a human and you mysteriously got disconnected right away. Or is it mysterious? According to Mike Daisey, Amazon rated their customer service representatives based on the number of calls taken per hour. The best way to get your performance rating up was to hang up on customers, thus increasing the number of calls you can take every hour.

Software organizations tend to reward programmers who (a) write lots of code and (b) fix lots of bugs. The best way to get ahead in an organization like this is to check in lots of buggy code and fix it all, rather than taking the extra time to get it right in the first place. When you try to fix this problem by penalizing programmers for creating bugs, you create a perverse incentive for them to hide their bugs or not tell the testers about new code they wrote in hopes that fewer bugs will be found. You can't win.

Don't take my word for it, read Austin's book and you'll understand why this measurement dysfunction is inevitable when you can't completely supervise workers (which is almost always).

-- Joel Spolsky on Measurement

The anecdotes above are just the tip of the iceberg. I've heard many stories over the years of harmful gaming of metrics. It is clear that you should not introduce metrics lightly. It seems best to either:

Define metrics that cannot be effectively gamed; or
Win people's trust that metrics are being used solely to improve company performance and will not be used against anyone.

Suggestions on how to achieve this are welcome.

Performance Appraisals

At a recent Agile metrics panel discussion, I was a bit surprised that everyone agreed that their teams had some "rock stars" and some "bad apples". And that "everyone knew who they were". And that you didn't need metrics to know!

That's been my experience too. I've found that by being an active member of the team, you don't need to rely on numbers, you can simply observe how they perform day to day. Combine with regular one-on-ones plus 360-reviews from their peers and customers and it is obvious who the high performers are and who needs improvement.

Though I personally feel confident with this process, I admit that it is subjective. I have seen cases where two different team leads have given markedly different scores to the same individual. Of course, these scores are at different times and for different projects. Still, personality compatibility (or conflict) between the team lead and team member can make a significant difference to the review score. It does seem unfair and subjective. Can metrics be used to make the performance appraisal process more objective? My feeling is that it would do more harm than good, as indicated in the "Software Metric Gaming" section above. What do you think?

Software Development Process Metrics

Lean-Agile City runs on folklore, intuition, and anecdotes

-- Larry Maccherone (slide 2 of "The Impact of Agile Quantified")

It's exceptionally difficult to measure software developer productivity, for all sorts of famous reasons. And it's even harder to perform anything resembling a valid scientific experiment in software development. You can't have the same team do the same project twice; a bunch of stuff changes the second time around. You can't have two teams do the same project; it's too hard to control all the variables, and it's prohibitively expensive to try it in any case. The same team doing two different projects in a row isn't an experiment either. About the best you can do is gather statistical data across a lot of teams doing a lot of projects, and try to identify similarities, and perform some regressions, and hope you find some meaningful correlations.

But where does the data come from? Companies aren't going to give you their internal data, if they even keep that kind of thing around. Most don't; they cover up their schedule failures and they move on, ever optimistic.

-- Good Agile, Bad Agile by Steve Yegge

As pointed out by Yegge above, software metrics are indeed a slippery problem. Especially problematic is getting your hands on a high quality, statistically significant data set.

The findings in this document were extracted by looking at non-attributable data from 9,629 teams

-- The Impact of Agile Quantified by Larry Maccherone

Larry Maccherone was able to solve Yegge's dataset problem by mining non-attributable data from many different teams, in many different organisations, from many different countries. While I found Larry's results interesting and useful, this remains a slippery problem because each team is different and unique.

Each project's ecosystem is unique. In principle, it should be impossible to say anything concrete and substantive about all teams' ecosystems. It is. Only the people on the team can deduce and decide what will work in that particular environment and tune the environment to support them.

-- Communicating, cooperating teams by Alistair Cockburn

By all means learn from Maccherone's overall results. But also think for yourself. Reason about whether each statistical correlation applies to your team's specific context. And Larry strongly cautions against leaping to conclusions about root causes.

Correlation does not necessarily mean Causation

The findings in this document are extracted by looking for correlation between “decisions” or behaviors (keeping teams stable, setting your team sizes to between 5 and 9, keeping your Work in Process (WiP) low, etc.) and outcomes as measured by the dimensions of the SDPI. As long as the correlations meet certain statistical requirements we report them here. However, correlation does not necessarily mean causation. For example, just because we show that teams with low average WiP have 1/4 as many defects as teams with high WiP, doesn’t necessarily mean that if you lower your WiP, you’ll reduce your defect density to 1/4 of what it is now. The effect may be partially or wholly related to some other underlying mechanism.

-- The Impact of Agile Quantified by Larry Maccherone

"Best Practices"

There are no best practices. Only good practices in context.

-- Seven Deadly Sins of Agile Measurement by Larry Maccherone

I've long found the "Best Practice" meme puzzling. After all, it is impossible to prove that you have truly found the "best" practice. So I welcomed Maccherone's opening piece of advice that the best you can hope for in a complex, empirical process, such as Software Development, is a good process for a given context. Which you should always be seeking to improve.

A common example of "context" are business and economic drivers. If your business demands very high quality, for example, your "best practice" may well be four-week iterations, while if higher productivity is more important than quality, your "best practice" may be one-week sprints instead (see the "Impact of Agile Quantified Summary of Results" section below for iteration length metrics).

Team vs Individual Metrics

From the blog cited by Athanasius:

(From US baseball): In short, players play to the metrics their management values, even at the cost of the team.

Yes, Larry Maccherone mentioned a similar anecdote from US basketball, where a star player had a very high individual scoring percentage ... yet statistics showed that the team actually won more often when the star player was not playing! Larry felt this was because he often took low percentage shots to boost his individual score rather than pass to a player in a better position to score.

Finding the Right Metrics

More interesting quotes from this blog:

The same happens in workplaces. Measure YouTube views? Your employees will strive for more and more views. Measure downloads of a product? You’ll get more of that. But if your actual goal is to boost sales or acquire members, better measures might be return-on-investment (ROI), on-site conversion, or retention. Do people who download the product keep using it, or share it with others? If not, all the downloads in the world won’t help your business.

In the business world, we talk about the difference between vanity metrics and meaningful metrics. Vanity metrics are like dandelions – they might look pretty, but to most of us, they're weeds, using up resources, and doing nothing for your property value. Vanity metrics for your organization might include website visitors per month, Twitter followers, Facebook fans, and media impressions. Here's the thing: if these numbers go up, it might drive up sales of your product. But can you prove it? If yes, great. Measure away. But if you can't, they aren't valuable.

Good metrics have three key attributes: their data are consistent, cheap, and quick to collect. A simple rule of thumb: if you can't measure results within a week for free (and if you can't replicate the process), then you’re prioritizing the wrong ones.

Good data scientists know that analyzing the data is the easy part. The hard part is deciding what data matters.

Schwaber recommends measuring:

Cycle time - quickest time to get one feature out
Release cycle - time to get a release out
Defects - change in defects
Productivity - normalized effort to get a unit of functionality "done"
Stabilization - after code complete, % of a release is spent stabilizing before release
Customer satisfaction - up or down
Employee satisfaction - up or down

Agile Measurement Checklists

Larry Maccherone's Seven Deadly Sins (and Heavenly Virtues) of Agile Measurement:

Sin: Using metrics as levers to change someone else's behaviour. Virtue: Use metrics for feedback to improve your own performance.
Sin: Unbalanced metrics. Virtue: Day-one have one metric from each quadrant. The quadrants are: Productivity (Do it fast); Quality (Do it right); Predictability (Do it on time); Employee Satisfaction (Keep doing it).
Sin: Believing metrics can replace thinking. Virtue: Use quantitative insight to complement rather than replace qualitative insight.
Sin: Too-costly metrics. Virtue: Favour automatic metrics from passively acquired data or lightweight surveys.
Sin: Using a lazy/convenient metric. Virtue: Use ODIM (Outcome/Decision/Insight/Measurement) to determine metrics that provide critical insight and drive your desired outcomes.
Sin: Bad analysis. Virtue: Get your statistics right by consulting experts.
Sin: Forecasting without discussing probability and risk. Virtue: Use the percentile coverage distribution, the cone of uncertainty, or Monte Carlo simulation.

Hank Marquis Seven Dirty Little Truths About Metrics cautions that metrics must derive from, and align with, business goals and strategies. And metrics should be selected only after understanding the needs the metric addresses.

What gets measured is what gets done.
Metrics drive both good AND bad behaviour.
Failure to align with Vital Business Functions (VBF, e.g. Revenue impact, data security) can lead you astray.
Metrics do not get better with age -- they often become obsolete.
The real purpose of metrics is to help you make better decisions.
Effective metrics do not measure people -- they measure teams and processes.
Good metrics help optimize the performance of the whole organization.

Further advice from Hank:

Align with Virtual Business Functions. Regardless of the IT activity, you need to make sure your metrics tells you something about the VBF that depends on what you are measuring.
Keep it simple. A common problem manager fault is overloading a metric. That is, trying to get a single metric to report more than one thing. If you want to track more than one thing, create a metric for each. Keep the metric simple and easy to understand. If it is too hard to determine the metrics, people often fake the data or the entire report.
Good enough is perfect. Do not waste time polishing your metrics. Instead, select metrics that are easy to track, and easy to understand.
Use metrics as indicators. A KPI does not troubleshoot anything, but rather the KPI indicates something is amiss.
A few good metrics. Too many metrics, even if they are effective, can overwhelm a team. Use three to six.
Beware the trap of metrics. Failure to follow these guidelines invariably results in process problems.

Impact of Agile Quantified Summary of Results

Maccherone's results were reported with regard to the following four dimensions of performance:

Responsiveness. Based on Time in Process (or Time to Market). The amount of time that a work item spends in process.
Quality. Based on defect density. The count of defects divided by man days.
Productivity. Based on Throughput/Team Size. The count of user stories and defects completed in a given time period.
Predictability. Based on throughput variability. The standard deviation of throughput for a given team over 3 monthly periods divided by the average of the throughput for those same 3 months.

Three further "fuzzier" metrics (often measured via lightweight surveys) are currently under development, namely:

Customer Satisfaction.
Employee Engagement.
Build-the-right-thing.

Stable teams resulted in:

60% better productivity
40% better predictability
60% better responsiveness

Recommendations:

Dedicate people to a single team
Keep teams intact and stable

If people are dedicated to only one team rather than multiple teams or projects, they stay focused and get more done, leading to better performance.

Estimating:

No Estimates: 3%
Full Scrum. Story points and task hours: 79%
Lightweight Scrum. Story points only: 10%
Hour-oriented. Task hours only: 8%

Teams doing Full Scrum have 250% better Quality than teams doing no estimating.
Lightweight Scrum performs better overall, with better Productivity, Predictability and Responsiveness.

Recommendations:

Experienced teams may get best results from Lightweight Scrum.
If new to Agile, or focused strongest on Quality, choose Full Scrum.

Work in Process (or WiP) is the measure of the number of simultaneous work items that are "In process" at the same time.

Teams that aggressively control WiP:

Cut time in process in half
Have 1/4 as many defects
But have 34% lower Productivity

Recommendations:

If your WiP is high, reduce it
If your WiP is already low, consider your economic drivers: if productivity drives your bottom line, don't push WiP too low; if time to market drives your bottom line, push WiP as low as it will go

Small teams (of 1-3 people) have:

17% lower Quality
But 17% more Productivity

than teams of the recommended size (5-9 people).

Recommendations:

Set up team size of 5-9 people for the most balanced performance
If you are doing well with larger teams, there's no evidence that you need to change

Iteration Length:

Teams using two-week iterations have the best balanced performance.
Longer iterations correlate with higher Quality.
Shorter iterations correlate with higher Productivity and Responsiveness.

Testers:

More testers lead to better Quality.
But they also generally lead to worse Productivity and Responsiveness.
Interestingly, teams that self-identify as having no testers have: the best Productivity; almost as good Quality; but much wider variation in Quality.

Motivation:

Motive has small but statistically significant impact on performance.
Extrinsic motiviation does not have a negative impact on performance.
Executive support is critical for success with Agile.
Teamwork is not the dominant factor; talent, skills, and experience are.
Those motivated by Quality perform best.

Co-location:

Teams located within the same time zone have up to 25% better productivity.

Other Articles in This Series

References

The Impact of Agile Quantified by Larry Maccherone (PDF)
The Impact of Agile Quantified by Larry Maccherone (slideshare)
Seven Deadly Sins of Agile Measurement by Larry Maccherone (PDF)
Agile practices: what's folklore, what's quantifiable?
Seven Dirty Little Truths About Metrics by Hank Marquis
Good Agile, Bad Agile by Steve Yegge
Joel Spolsky: Measurement
Performance Indicator (wikipedia)
SMART (wikipedia)

Measuring and Managing Performance in Organizations book by Robert D Austin
Kanban and Scrum making the most of both by Henrik Kniberg & Mattias Skarin
Kanban book by David Anderson
SDLC 3.0 Beyond a Tacit Understanding of Agile book by Mark Kennaley
Process Dynamics, Modeling, and Control book by Ogunnaike and Ray
Moneyball: The Art of Winning an Unfair Game book by Michael Lewis

References Added Later

Measuring programmer quality by deorth (2007)

Update: Added new sub-sections to "Summary of Results" section: Iteration length; Testers; Motivation; Co-location. 23-July-2014 Update: Added new sections: Team vs Individual Metrics, Finding the Right Metrics; 23-Nov-2014: Added Schwaber-recommended metrics.

In reply to Nobody Expects the Agile Imposition (Part VII): Metrics by eyepopslikeamosquito

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Don't ask to ask, just ask
	PerlMonks